I was spending hours every week manually fixing the same flaky Cypress tests. The failures had patterns. So I built a bot to recognize them.
The problem
In a large frontend monorepo, flaky tests are a constant drain. They block merges, slow down every pipeline run, and become a blocker for the whole team. And if you just skip them instead of fixing them — which is the easy short-term answer — you're slowly opening the door to production failures nobody saw coming.
What I noticed after months of fixing them: most failures aren't random. They're patterns. The same handful of root causes — hardcoded cy.wait() calls, race conditions on async state, tests that assumed backend services would respond the same way they do in production (they don't, not reliably), improper use of selectors — kept showing up in different tests, in different files, written by different engineers at different times.
The knowledge to fix them existed. It just wasn't encoded anywhere you could access while staring at a failing pipeline at 4pm.
The idea
What if I encoded that knowledge into a retrieval system and let an AI reason over it when analyzing a failing test?
That's the core of what I built — the Pipeline Pathologist. Give it a knowledge base of known failure patterns, feed it a failing test, and it retrieves the closest match and reasons over the actual code. Not "here's a general best practice." Specific: "This test uses cy.wait(2000) on line 47. Replace with cy.intercept() + cy.wait('@alias'). Here's why and here's the code."
The build
The stack is deliberately simple:
-
Knowledge base: A JSON file (
known_issues.json) — each entry has an ID, description, symptoms, root cause, and a fix recipe - RAG layer: Python, using embeddings to match incoming test code against the pattern library
- LLM reasoning: Claude (via API) generates the recommendation in context
- Integration: MCP connectors to Jira and GitLab — the bot can pull the ticket, understand the context, and comment on a merge request
The whole thing runs locally — intentional for the pilot. No infrastructure, no approval process. Run it, point it at a failing test, get a recommendation.
What happened
I used it on my own flaky test fixes first. The first attempt was rough — the recommendations were too generic to be useful, and I realized quickly that the problem wasn't the RAG setup. It was what I had put in the knowledge base.
The first version had all the standard Cypress recommendations — avoid cy.wait(), use interceptors, prefer data-testid over CSS selectors. All correct. All useless for retrieval. When a failing test came in, the bot returned advice that didn't match the actual problem.
What worked was specificity: our particular QA environment quirks, our custom test helpers, the exact error signatures we kept seeing. The moment the patterns reflected our codebase instead of the internet's generic best practices, the recommendations became immediately useful.
Then a colleague tried it on a completely different ticket. It worked for her too.
That was the moment it stopped being a personal script and started being something worth sharing.
It worked. Now what?
Once the team started using it, the question changed. It went from "does this work?" to "how do we make this the default?"
The immediate work was reliability — getting it to install cleanly on any machine, tightening the prompts, making the integrations more solid. After that, the goal is to shift left: instead of fixing flaky tests after they merge, catch the patterns during code review before they ever hit main.
Where I want to take it eventually: the system opens the merge request itself. It detects the pattern, applies the recipe, writes the fix, opens the PR. You review and approve. The investigation and the code are automated. I don't know exactly when we get there, but the architecture already supports it.
A few things I didn't expect
The technical build was the easy part. Sitting down and documenting every root cause I'd seen over two years of fixing these things — that took longer than writing the Python. But that investment compounds. Every new fix we document makes it more useful for everyone, and the knowledge base is now a shared team resource.
The real insight here isn't about RAG. It's about the knowledge base. The AI layer is the easy part to get right — retrieval, embeddings, LLM reasoning, that's all solvable. What you put in the knowledge base is where most of the work actually lives. Garbage in, garbage out, except with RAG it's less obvious because the outputs still sound confident.
Prompts need version control. I started improvising at runtime and the output quality was inconsistent in ways I couldn't debug. Once I started versioning prompts the same way I version code — committing changes, noting what improved — quality got predictable fast.
The bot didn't sell itself. I spent a week assuming it would. I wrote one document explaining how to install and run it. Then people started using it.
Where this is going
The question I keep coming back to: at what point does a system like this generate more value per hour than a human doing the same work? I don't have the answer yet. But I think it's closer than most teams assume.
If you've built something similar or dealt with this at scale — I'd genuinely like to know how you approached the knowledge base problem. That's the part with the least written about it.
Top comments (0)