DEV Community

Tudor Brad
Tudor Brad

Posted on • Originally published at betterqa.co

AI isn't killing creativity in QA, but it's not saving it either

When ChatGPT showed up, the QA world split into two camps overnight. Half the people I talked to were terrified they'd be automated out of a job within a year. The other half were already pasting requirements into Claude and calling the output a "test plan." Both groups were wrong, but in interesting ways.

At BetterQA, we have 50+ engineers spread across 24 countries. We built our own test management platform, BugBoard, and a browser-based test recorder called Flows. So when AI tools started flooding into QA workflows in 2024, we had a front-row seat to what actually happened. Not what the LinkedIn posts said happened. What actually happened.

The part where everyone resisted

The first reaction from most of our engineers was skepticism, and honestly, it was justified. We'd been through hype cycles before. Remember when everyone said codeless automation would replace test engineers? That didn't exactly pan out.

So when we started integrating AI test case generation into BugBoard, about a third of the team genuinely engaged with it, another third tried it once and went back to writing cases manually, and the remaining third just ignored it. Nobody was fired. Nobody was replaced. The work kept going.

What shifted things wasn't a mandate from management. It was one engineer in Bucharest who figured out that AI was genuinely good at one narrow thing: generating the first draft of repetitive functional test cases for CRUD operations. The boring ones. Create a record, read it back, update a field, delete it, confirm deletion. Nobody enjoys writing those. The AI could spit out 40 of them in a minute, and maybe 30 would be usable after a human reviewed them.

That spread organically. People started using AI for the tedious baseline work and spending their freed-up hours on the stuff that actually requires a brain: edge cases, race conditions, what happens when the user does three things simultaneously that the spec never anticipated.

The part where it went wrong

Here's what the optimistic AI articles leave out.

AI-generated test cases drift hard toward happy paths. You ask it to generate tests for a login form, and you get beautiful coverage of valid emails and correct passwords. What you don't get is: what happens when someone pastes a 10,000-character string into the email field? What about authentication when the session token expires mid-request? What about the user who opens the app in two tabs and logs out of one?

We caught this pattern about two months in. Engineers who leaned too heavily on AI-generated cases were shipping test suites that looked comprehensive on paper but missed the kinds of bugs that actually make it to production. The coverage numbers looked great. The bug escape rate told a different story.

Then there's the hallucination problem. We had AI-generated bug reproduction steps that sounded perfectly plausible, referenced UI elements that existed, described a logical sequence of actions, and were completely wrong. The bug was real, but the reproduction path was fabricated. An engineer spent half a day following AI-generated steps before realizing the AI had basically made up a story that fit the symptoms.

That's the core trust problem with AI in QA: it sounds confident whether it's right or wrong. A junior engineer writing sloppy repro steps at least knows they're guessing. The AI doesn't hedge. It presents fiction with the same tone as fact.

Where the time actually went

The honest accounting of time savings looks nothing like the pitch decks suggest.

Before AI tools, a typical engineer on our team spent maybe 40% of their day writing test cases and documentation, 20% executing tests, 20% investigating and reproducing bugs, and 20% in meetings, syncs, and reporting. The fantasy was that AI would eliminate that first 40% and everyone would spend the extra time on deep exploratory testing.

What actually happened: AI cut the test case writing time roughly in half, so about 20% of the day freed up. But a chunk of that time went straight into reviewing and fixing what the AI generated. You can't just accept AI output without checking it, not if you care about quality. So the net gain was more like 10-12% of the day back. Real, but not revolutionary.

Where that time went varied by person. Some engineers used it for exploratory testing, which is where it should go. Others used it to take on more projects, which is fine but isn't the "creativity unleashed" narrative. A few used it to leave work earlier. I'm not judging that either.

The thing AI is actually bad at

Our founder, Tudor Brad, has a line I keep coming back to: "AI will replace development before it replaces QA."

It sounds provocative, but the logic holds up. Development is increasingly about translating specifications into code, and AI is getting scarily good at that. QA is about figuring out what the specifications missed. It's about understanding how a real person, distracted and impatient and on a bad network connection, will interact with software that was designed for an ideal user in ideal conditions.

That requires a kind of adversarial creativity that AI doesn't have. AI can generate test cases from requirements. It cannot look at a feature and think, "I bet someone will try to use this in a way nobody intended." It cannot notice that the checkout flow feels sluggish on a Thursday afternoon because that's when the payment provider's API slows down. It cannot feel the frustration of a screen reader user encountering a modal that traps focus.

Tudor also says something that sticks with me: "You don't want your first clients to be the first humans utilizing your product." That's the argument for QA in general, but it's even more relevant now. If you're shipping features faster because AI helps developers write code faster, you need QA that can keep pace. Speed without quality is just producing bugs faster.

What we actually recommend now

After a year of watching this play out across dozens of client projects, here's where we landed.

Use AI for first-draft test cases on well-understood, repetitive functionality. Review everything it produces. Don't trust coverage metrics generated from AI test suites without verifying the cases actually test what they claim to test.

Don't use AI for security testing logic, complex business rule validation, or anything where a wrong answer looks identical to a right answer. Those are human problems. We've found AI useful for suggesting areas to investigate during security testing, but the actual testing needs a person who understands what they're looking at.

Use AI for documentation and bug report formatting. This is honestly where the biggest time savings come from, and it's the least glamorous. Nobody writes blog posts about how AI made their Jira tickets more readable. But the engineers on our team who use BugBoard's AI formatting consistently file clearer bug reports, which means developers fix bugs faster, which means the whole cycle improves. Boring? Yes. Useful? Also yes.

Don't use AI as a replacement for understanding the system you're testing. The temptation is real: paste the requirements into AI, get test cases, execute them, move on. But if you don't understand why those test cases exist, you can't adapt when the requirements change or when something unexpected happens during testing.

Where this is actually heading

I don't think AI is killing creativity in QA. I also don't think it's unleashing some hidden creative potential that was always there, waiting to be freed from the shackles of documentation. The reality is messier and less interesting as a headline.

AI is a tool that's good at some things, bad at others, and dangerous when you don't know which is which. The teams doing well with it are the ones that figured out the boundaries through trial and error, not the ones who read a whitepaper and went all-in.

At BetterQA, we're still figuring it out too. We keep adjusting what we recommend to clients based on what we're seeing in practice. That's the honest answer: nobody has this fully solved yet, and anyone claiming otherwise is probably selling something.

More on what we're learning as we go at betterqa.co/blog.

Top comments (0)