Every few months someone publishes a blog post claiming AI has made QA engineers obsolete. The argument usually goes like this: AI can generate test cases, AI can run them faster than a human, AI can find patterns in logs, therefore why are you still paying people to click buttons?
I work at a company that builds AI testing tools. We sell them. We use them daily. And I'm going to tell you, honestly, that the argument for keeping humans in QA has almost nothing to do with the reasons people usually give.
It's not about "human intuition" or "the human touch" or whatever vague phrase gets dropped into these discussions to make testers feel better. The real gap is simpler and harder to fix: AI cannot decide what to test. It can only test what you already told it about.
The green build that shipped broken
I watched a developer wire Playwright up to an MCP server last year. The idea was solid. Record user flows, generate assertions automatically, let the AI maintain the selectors when the DOM changed. The whole suite ran overnight. Everything green. Ship it.
The product was broken. Users couldn't complete the checkout flow because a loading spinner hung indefinitely on slow connections. The tests all passed because none of them tested slow connections. None of them tested what happens when a user pauses mid-flow to check their bank balance on another tab and comes back. None of them asked "what does a confused person do here?"
The tests were technically correct. They answered every question they were given. They just weren't asking the right questions. And that's the gap nobody talks about.
Tudor Brad, BetterQA's founder, puts it bluntly: "AI will replace development before it replaces QA." The reasoning is counterintuitive until you think about it. Development is about producing something that works according to a specification. AI is getting very good at that. QA is about figuring out all the ways something might not work, for people who haven't read the specification, on devices you didn't plan for, in situations nobody documented. That's a fundamentally different problem.
What AI is actually bad at
Let me be specific, because "AI lacks judgment" is too vague to be useful.
AI doesn't question requirements. Hand a model a spec and it will generate tests for that spec. If the spec is wrong, the tests will faithfully validate the wrong behavior. I've seen AI-generated test suites where every test passed and the feature was still unusable, because the acceptance criteria described something no user would actually want. A human tester reads a spec and says "wait, this doesn't make sense." An AI reads the same spec and says "here are 47 test cases."
AI can't simulate confusion. A first-time user doesn't know what your buttons do. They don't know the difference between "Save" and "Submit." They hover over things, they click back, they try entering their phone number in the email field. An AI testing agent follows the flow it was given. It doesn't get lost. It doesn't misunderstand. That makes it useless for catching the class of bugs that only appear when someone doesn't know what they're doing.
AI hallucinates confidence. Ask a model to check for duplicates among your test cases and it will say "no duplicates found" while quietly duplicating half of them. Ask it to verify a screenshot against expected behavior and it will describe elements that aren't in the image. Ask it for edge cases and it will generate five variations of the happy path and present them as though they're adversarial. The failure mode isn't "AI gets it wrong." It's "AI gets it wrong and presents the result with the same confidence as when it's right."
AI drifts toward success. This is the one that keeps me up at night. Left to its own devices, an AI test generator produces tests that pass. Not because it's cheating, but because the training data is biased toward working software. It generates the tests that look like the tests it's seen before. Those tests were written for software that mostly worked. So you get beautiful coverage of the paths that work and almost nothing for the paths that break.
Where AI actually helps (and we use it heavily)
I'm not arguing against AI in testing. That would be hypocritical, because we built an entire product around it.
BugBoard, our test management platform at bugboard.co, uses AI to generate test cases from bug history, create regression suites from closed issues, and analyze screenshots for visual defects. It does this faster than any human could. An engineer used to spend a week writing test cases for a new module. The AI draft takes about 30 seconds.
But here's the part people skip: a human reviews everything before it's saved. Every generated test case, every proposed regression scenario, every screenshot analysis result goes through a person who can say "this is wrong" or "this is right but irrelevant" or "you missed the actual risk." Nothing is written to the project without someone accepting it.
This isn't a formality. In our early versions, we let the AI write directly to the test suite and the results were a mess. Duplicate tests everywhere. Tests for features that didn't exist. Test IDs that collided with existing ones. The model was fast and productive and wrong often enough to erode trust in the entire suite. We learned that an AI that's wrong five percent of the time in a QA context is worse than no AI at all, because QA exists specifically to catch the five percent that's wrong.
The vibe coding problem
Here's what changed in the last two years. Development got dramatically faster. Features that used to take a team three months now take a solo developer three hours with Copilot or Cursor or Claude. That's real. That's happening.
But faster development means more surface area to test, more features shipping per sprint, more code generated by models that are confident but not always correct. The testing bottleneck didn't shrink. It grew. And it grew in exactly the direction where AI testing is weakest: novel flows, new edge cases, features that didn't exist last week and don't have historical test data to learn from.
This is why the "AI replaces testers" narrative is backwards. The more AI writes code, the more you need humans reviewing what that code actually does when real people touch it. You don't want your first users to be the first humans who interact with your product. That's not a testing strategy. That's a prayer.
Regulations aren't going anywhere
There's a pragmatic angle too. If you're building healthcare software, fintech products, or anything that touches personal data, a regulator will eventually ask you who verified that the system works correctly. "Our AI ran the tests" is not an answer anyone will accept. Compliance requires accountability, and accountability requires a person who understood what they were testing and why.
This isn't theoretical. We work with clients across regulated industries and the conversation always lands in the same place: AI can help you test faster, but someone with a name and a job title needs to sign off on the results.
The right framing
The question isn't "will AI replace testers?" The question is "what parts of testing can AI do well, and what parts does it do badly?"
AI is good at: generating test case drafts from existing data, maintaining selectors when the DOM changes, running thousands of identical checks overnight, scanning logs for known patterns, flagging visual differences between screenshots.
AI is bad at: deciding what to test in the first place, simulating a confused or malicious user, questioning whether the feature should exist at all, noticing that the spinner never stops, knowing when something feels wrong even though every assertion passes.
The first list is labor. The second list is judgment. You can automate labor. Judgment is the part that makes QA a profession and not a script.
Where this leaves us
I don't think AI will replace testers. I also don't think the argument should be framed defensively. This isn't about protecting jobs. It's about building software that works for actual people in actual conditions.
The companies that will do testing well in the next few years are the ones that use AI for the repetitive, high-volume, pattern-matching work and keep humans on the questions that require asking "should this exist?" and "what would go wrong for someone who isn't me?"
That division of labor is not a compromise. It's the only approach I've seen that actually works.
More at betterqa.co/blog.
Top comments (0)