The three AI tools we tried for QA and the one we kept

#testing #automation #security #ai

The first time I watched an engineer try to use ML to generate test cases for a client project, she had been at it for about six hours. She was using Diffblue Cover on a Java monolith, which had been pitched to her as an automatic unit test generator. It had produced something like 2,400 tests. Every single one of them passed. None of them would have caught the bug we were actually chasing, which was a session expiry race condition in the checkout flow.

I still remember her sitting there staring at the coverage report, which was showing 87 percent, and asking me whether anyone had ever actually shipped software on the back of those numbers. The tests weren't wrong, exactly. They just verified that the code did what the code did. A method that multiplied two numbers had a test confirming it multiplied two numbers. A controller that returned a 200 had a test confirming it returned a 200. The checkout race condition lived in the gap between two services, and no amount of per-method unit tests were going to find it.

That was maybe two years ago. Since then we've tried three different AI-flavoured tools for QA work at BetterQA, and I want to be honest about which ones wasted our time and which one earned a place in our workflow. Because the marketing around "AI in QA" has gotten so loud that you can't find an honest post about any of it, and I'm tired of reading whitepapers that explain the benefits without admitting the failures.

The visual regression disaster

The second tool was an AI-powered visual regression service. I'm not going to name it, because the team behind it is probably working hard and the category has moved on, but the pitch was beautiful. You give it a URL, it takes baseline screenshots, then on every deploy it compares the new render against the baseline and tells you what actually changed visually, not just pixel-by-pixel.

The promise was: no more flaky screenshot diffs. The AI understands when an animated element is just in a different frame. It knows the difference between "the header font changed" and "the ad network served a different creative."

We ran it against a client e-commerce site for two weeks. On one deploy it flagged 400 differences. Four hundred. Most of them were product carousel images that had rotated since the baseline was captured. Some were the cookie banner appearing in a slightly different state. A handful were genuine CSS regressions that we did need to know about, and those were buried under the noise like needles in a haystack made of other needles.

We tuned it. We set thresholds. We excluded regions. We trained it on what to ignore. After two weeks our engineer Mihai said the exact phrase "I would rather write Cypress assertions by hand than tune this thing for another day," and we killed the pilot. The problem wasn't that the AI was bad at image comparison. It was genuinely impressive at that. The problem was that "visual regression" is not actually a question an AI can answer without understanding intent, and intent is exactly what these tools don't have.

The bug triage bot that tried to be helpful

The third experiment was an LLM-based bug triage assistant. We set it up to read incoming bug reports from one client's Jira, classify them by severity, route them to the right team, and draft a first-response message to the reporter. This one wasn't a total failure. The classification was decent. The routing worked most of the time. But the drafted responses had this confident, slightly off quality where they would reference things the reporter hadn't said, or reassure them about a fix timeline that hadn't been agreed. One of our clients got back a response that promised a hotfix within 24 hours for a bug that was clearly a feature request.

We pulled the drafting feature. Kept the classification. It's still running, and it saves our triage lead maybe 20 minutes a day. Useful, but not the 10x productivity boost the vendor's landing page promised.

The one that earned its keep

Here's the one that worked. We built it ourselves. Inside BugBoard, our test management platform, we use the Anthropic API (specifically Claude) to help generate test cases from user stories and requirements documents. You paste in a feature description and it produces a draft set of test cases covering happy paths, edge cases, negative scenarios, and some permission-matrix stuff that humans tend to forget.

The honest version of this story: it's not magic. The draft is rarely shippable as-is. Our engineers review every single generated case, rewrite about 40 percent of them, delete maybe 15 percent as nonsense, and keep the rest. The cases that get kept are usually the boring ones that a tired QA engineer would have missed on a Friday afternoon. The nonsense cases are the ones where Claude hallucinates a feature that doesn't exist because it pattern-matched against something in the requirement doc that looked familiar.

But even with that 40 percent rewrite rate, it's saved us hours per project. A senior QA engineer can produce a first draft of a test plan in 30 minutes instead of a day. The 30 minutes still includes review time. The day used to be mostly typing.

The key thing is that we never let it skip the human. There is no auto-publish. There is no "generate test plan and run it unattended." Every draft goes through a human who understands the product, the client, and what the test is actually supposed to prove. If we stripped out that review step, we'd be shipping the same tautological nonsense that Diffblue produced for us in the first story.

What I actually think is happening

After doing this for long enough, I've stopped believing the framing of "AI for QA." I think AI is good at certain subtasks that QA engineers do, and it's bad at the thing that actually makes QA valuable, which is adversarial creativity. A good tester looks at a form and thinks, what if I paste 10,000 characters in here. What if I open two tabs and submit from both. What if my connection drops halfway through. What if the admin user and the regular user both hit this endpoint at the same time. None of that is pattern matching. It's imagination, and specifically the pessimistic kind of imagination that comes from having been burned by production incidents.

Tudor, who founded BetterQA back in 2018 in Cluj-Napoca, has this line he uses in almost every talk: "AI will replace development before it replaces QA." When I first heard it I thought it was a sales line. After the Diffblue experiment, and the visual regression disaster, and the triage bot that invented hotfix timelines, I think it's just true. Development is about turning intent into code, which LLMs are increasingly good at. QA is about figuring out what the intent missed, what the developer didn't think of, what the product manager didn't write down, what the end user is going to do that nobody predicted. That's a very different job, and current AI is not close to doing it.

Also: the chef should not certify his own dish. An AI trained on your codebase is just a very fast chef tasting its own cooking. It will tell you the food is fine. It will be confidently wrong. You need someone standing outside the kitchen who doesn't care about the deadline and isn't related to the sous-chef.

ML systems are not traditional software, and that matters

One thing the original version of this post got right, and I want to keep, is that testing ML models is a fundamentally different job from testing deterministic software. If you're shipping a model that makes recommendations, classifies images, or scores loan applications, your QA process has to deal with things traditional testing doesn't touch.

Data quality is the first one. I've seen teams train models on data that had a bias baked in from the sampling process nobody documented, and then wonder why the model kept producing weird outputs for one demographic group. The model was working correctly. It was learning exactly what it had been shown. The problem was upstream.

Model drift is the second one. A model that's 94 percent accurate on Monday can be 81 percent accurate on Thursday if the real-world inputs have shifted. You can't catch this with a traditional regression test suite because there isn't a fixed expected output. You need production monitoring that flags when the distribution of predictions starts to look different, and you need someone qualified to look at that and decide whether it's a retraining problem or a data source problem.

And then there's the fairness and compliance layer, which is where things get interesting if you're shipping into regulated industries. A hiring model that disadvantages a protected class isn't just technically incorrect, it's illegal in most jurisdictions. GDPR's right to explanation means you need to be able to tell a user why the model made the decision it made, which is a fun constraint to impose on a neural network. Auditing this stuff requires QA engineers who understand the legal frame, not just the code.

What we actually do for clients shipping ML features

When a client asks us to QA an ML product, we don't run a test suite and sign off. We do the boring deterministic testing around it, because the rest of the product still matters. Then we build a validation harness that checks the model against held-out data, measures precision and recall on the specific subgroups the client cares about, and flags drift from the baseline. Then we do adversarial testing, which means trying to get the model to produce bad outputs on purpose. If it's a content classifier, we try to sneak things past it. If it's a chatbot, we try to get it to say things it shouldn't. If it's a recommendation engine, we look for echo chambers.

The last piece, and this is the one I care about most after the last year of AI adoption, is prompt injection testing for anything LLM-powered. If the product includes a model that can be talked to, we try to talk it into leaking things it shouldn't leak. This is a new category of testing that basically didn't exist three years ago, and I think it's going to be a huge part of QA for the next decade. The attack surface of "what can a user type into a text box that makes the AI do something it wasn't supposed to do" is enormous, and most teams shipping AI features haven't thought about it at all.

The short version

We tried Diffblue-style unit test generation. It produced tests that proved the code did what the code did, and missed the actual bug. We tried AI visual regression. It generated 400 false positives on a single deploy and we killed the pilot after two weeks. We tried LLM bug triage. We kept the classification part and pulled the drafting part after it promised a hotfix nobody had agreed to.

The one that worked is the one where AI drafts and humans decide. Inside BugBoard, Claude helps us generate test case drafts that still get reviewed by a human QA engineer who knows the product. It saves real hours. It does not replace the human. If we ever remove that review step, we will end up with 2,400 passing tests and a broken checkout flow, and we'll deserve it.

That's the thing I wish more people would say out loud. AI can make QA faster. It cannot make QA better on its own. The better part still comes from a person who's been burned enough times to know what to look for, and who is paid to care when everyone else is trying to ship.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.