DEV Community

Ben Santora
Ben Santora

Posted on

QA - The Challenge of Testing AI

AI testing and quality assurance have become unusually difficult challenges in modern software engineering. Unlike traditional software systems, AI models do not allow their behavior to be fully specified in advance, and this difference brings into question how quality should be measured.

In fields such as mechanical engineering, a bridge can be tested against well-defined safety standards. In software engineering, applications can be validated by proper execution of their desired functions. These classical testing practices originated from systems whose behavior could be specified first and verified later. AI systems invert that relationship. They behave first, and we judge the results afterward. The difficulty in AI quality assurance is not a failure of engineering discipline, but a consequence of testing a fundamentally different kind of system.

The success of traditional software rests on contractual correctness. A system is designed to do B when given A. If it does C instead, something is wrong. The failure is clear, local, and fixable. Unit tests exist to enforce these explicit promises. Logic errors can be isolated, patched, and verified. This entire framework depends on the system exposing a deterministic contract between inputs and outputs.

But AI systems just don't work that way. At inference time, there are no if–then rules to validate against, no one single “correct” answer can be guaranteed. A model produces outputs based on learned probability distributions shaped by training data and optimization objectives. In many cases, multiple outputs are reasonable and even preferred, so the user gets some kind of answer to his or her prompt. No response from the model can be considered as 'strictly incorrect' in the classical sense.

This represents a change in the form of the contract itself. Instead of guaranteeing exact outputs, AI systems can only guarantee statistical behavior within acceptable bounds. The system provides tendencies, not precise outcomes.

As a result, the concept of a “bug” becomes ambiguous. When a model produces a bad answer, the inference process can't really be blamed - it likely worked exactly as designed. The math executed correctly. The weights were applied as intended. Any failure, if it can even be called that, happened upstream, from the training data and optimization choices. The error is not a faulty branch or an incorrect variable; it is one outcome of the model's learned behavior. Debugging, in the traditional sense, is not really applicable since the model didn't necessarily 'fail'.

Because of this, AI quality assurance can't really ask, “Is this correct?” Instead, it asks, “Is this acceptable?” Rather than pass/fail assertions, it comes down to adherence to reference datasets, human preference judgments, and statistical evaluations. To further complicate things, one AI model is often used to approximate how humans will evaluate another. The outputs of these processes are scores, confidence intervals, not precise indicators.

This reflects a genuine shift in what can be measured. Traditional QA enforces correctness against a known specification. AI QA measures acceptability against human-aligned expectations that cannot be fully formalized in advance. Evaluation happens after behavior emerges, not before it is defined.

And really - if I ask an llm to 'create a website' and I don't like the final result, did the model fail? No - it was my failing to give it more precise information so that the result is closer to what I'm expecting.

I've actually hit upon this recently using and testing slms (small language models) after thousands of hours using the larger online models. You have to be very specific with an slm - it simply doesn't have the bandwidth or training data onboard to 'figure out' for you, what it is that you want. This has forced me into being more focused and precise when I ask it for something. A good habit - it reminded that my part in the process was important - right from the start.

Ben Santora - January 2026

Top comments (0)