Agentic QA pipelines: what actually worked, and what didn't

#devops #automation #testing #ai

A few years ago our QA engineers started doing something strange. They'd keep two browser tabs open all day. One was our test management tool. The other was ChatGPT.

They'd paste a requirement in, ask for test cases, copy the output back, reformat it, rename things to match our conventions, realize it duplicated tests we already had, delete half of it, try again. Repeat for bugs. Repeat for log analysis. Repeat for screenshots. Every day, for hours.

We actually measured it. Our senior engineers were spending 20 to 30 percent of their time moving text between an LLM and the tools that held their real work. The AI was fast. The humans were slow. The bottleneck was the gap between them.

So we tried to close it. The first two attempts were bad.

First attempt: stuffing context into the prompt

The obvious move was to paste more context into ChatGPT. "Here's our naming convention. Here are the existing test cases. Here's the bug history. Now generate me new tests."

It kind of worked for ten minutes, then fell apart. Context windows were too small, pastes were too big, and engineers kept forgetting which project's context they'd loaded into which chat. We had one case where a payment flow got regression tests written against a completely different product's bug history. The tests looked reasonable. They were all wrong.

We also had a trust problem. The model would cheerfully invent test cases for features that didn't exist. It would reference requirements nobody gave it. It would generate test IDs that collided with existing ones and, when asked to check for duplicates, say "no duplicates found" while duplicating half of them.

An AI that is wrong five percent of the time in a QA context is unusable. QA exists to catch the five percent that's wrong. If the tool generating your tests is also the source of the bugs, you've built a loop.

What we ended up building

We stopped trying to make a generic chatbot smart about our projects and built tools where the model runs inside the data instead. The AI in BugBoard doesn't get pasted a project summary. It queries the project directly, through a filtered slice it's allowed to see.

When an engineer asks for regression tests, the model reads the actual closed bugs in that project, which ones were recurring, what modules they hit, what the root causes were. When someone uploads a screenshot of a broken UI, it's read against the project's known components and compared to past screenshots, and if the model can't reproduce the finding clearly in text, it flags low confidence instead of inventing detail. Nothing gets saved without a human accepting it. No silent writes.

The hard part turned out to be deduplication. The naive approach is cosine similarity on titles, which gives you garbage, because "Login button unresponsive on iOS" and "iPhone login broken" look nothing alike to an embedding but describe the same bug. We ended up running a second pass where the model reads both candidates in full and explains in plain text whether they're the same thing. Slower. More expensive. Also the only thing that worked.

Flows, and the time self-healing made everything worse

The browser automation side was messier.

We have a Chrome extension called Flows that records browser actions and replays them. Everyone has built one of these. The interesting question is what you do when the DOM changes, because the DOM always changes.

Our first self-healer was aggressive. When a selector broke, it would look for anything close: nearby text, similar roles, near-identical structure. Tests stopped failing. Engineers loved it. And then we noticed tests had also stopped catching regressions, because "healing" kept silently clicking the wrong button and the steps downstream were passing by accident.

So we pulled it back. Now when a selector breaks, we show the proposed replacement and mark the run yellow, not green, until a human confirms. Fewer red builds. More yellow ones. The engineers hated this for about a week, then admitted it was right.

A test that passes when it shouldn't is worse than a test that fails when it shouldn't. We learned this one the expensive way.

What still doesn't work

Three things we'd rather not advertise but are true.

AI-generated tests still drift toward happy paths. If you don't explicitly ask for negative cases, the model quietly pretends users only ever enter valid data. We've started enforcing this at the prompt level and it helps, but not enough.

Screenshot analysis falls apart on dense data tables. It hallucinates values that aren't in the image. We strip table findings out of the output unless the table is fully visible and readable in the original.

And the "assistant that remembers your project" thing is kind of an illusion. The models don't remember between sessions. We persist context to our own storage and rebuild it on every call. A lot of serialization code is holding that illusion together, and when it breaks, the model suddenly doesn't know your project anymore. We're not sure there's a clean fix.

The part I keep coming back to

The reason QA still needs humans isn't sentimental. It's that an agent can run a thousand tests against your app overnight and report everything green while an actual user can't complete a purchase because the spinner won't stop spinning. Nobody told the tests to check the spinner. Nobody thought to.

I've watched a developer wire Playwright up to an MCP server, get green all the way down, and ship something broken. The tests weren't wrong. They just weren't asking the right questions. Agents are getting pretty good at answering questions. They're bad at deciding which questions are worth asking in the first place.

That's the piece we're not trying to automate.

BugBoard is free to start at bugboard.co. Flows is in the Chrome store. If you try them and something breaks, please tell us. We learn more from the runs where the illusion falls apart than the ones where it holds.

More of this at betterqa.co/blog.