Albert Alov

Posted on May 15

Stop Guessing Why Your Tests Flake. Build a Knowledge Graph Instead.

#playwright #testing #mcp #sqlite

Flaky tests are the silent killers of engineering velocity. One day your CI is green, the next it's a "random" red, and by next week your team is ignoring 30% of the test suite and hitting Retry on everything.

The typical response is reactive: look at the last trace, try a fix, hope. But the trace only tells you what failed in this one run. It doesn't tell you whether this test has been silently flaking for two weeks, or only fails on Firefox in CI, or whether a recent deploy made things worse.

What if you treated flakiness as a data problem?

Enter flakiness-knowledge-graph-mcp.

🕸️ What is a Test Knowledge Graph?

Most reporters give you a snapshot of a single run. A knowledge graph gives you the accumulated history, trends, and environmental context of every test over time.

This MCP server pairs a custom Playwright Reporter with a SQLite backend to build a persistent memory of your test suite's behavior. It doesn't just know that a test failed — it knows it fails 15% of the time, exclusively on Firefox, and that the rate has been climbing steadily for the past two weeks.

🛠 Architecture: Three Moving Parts

1. The Playwright Reporter

Add it to playwright.config.ts once. Every time a test finishes, the reporter captures:

Test ID, title, and suite
Outcome — passed, failed, flaky, or skipped
Duration in milliseconds
Environment — browser and OS
Error message (first 1000 characters)
Retry attempt number

The reporter is designed to survive parallel workers: writes are serialized through a per-path promise queue, and each write re-reads from disk first to pick up changes from other workers. No data gets lost or overwritten.

2. The SQLite Backend

Instead of heavy infrastructure, everything goes into a local .db file via sql.js — pure-JS SQLite compiled from WebAssembly. No native compilation, no Docker, no server. The file travels with your project.

3. The MCP Interface

By exposing the database through the Model Context Protocol, your AI assistant (Claude, Cursor) can query your full test history conversationally. You ask a question; the AI calls the right tool, reads the data, and synthesizes an answer.

🧠 Six Analytical Tools

`get_flaky_tests`

Returns tests ranked by flakiness rate — the percentage of runs that ended in failed or flaky. Filters out one-off failures with a minimum run threshold so you see signal, not noise.

`get_test_history`

Full run-by-run history for a specific test: status, duration, error, retry count, browser, OS. Lets you spot exactly when a test started misbehaving.

`get_failure_patterns`

Breaks down failure rates by browser × OS combination. If a test has 100% failure rate on WebKit and 0% on Chromium, the answer isn't "fix the test" — it's "fix the Safari-specific behavior."

`get_slow_tests`

Ranks tests by average and max duration. Tells you which tests are the bottleneck in your CI pipeline and where optimization effort will have the most impact.

`get_error_groups` ⭐ new

Groups failing tests by error message signature (first 200 characters). When 10 tests fail with the exact same Error: connect ECONNREFUSED 127.0.0.1:3000, that's one broken backend service — not 10 separate bugs. This tool surfaces that signal immediately.

`get_flakiness_trend` ⭐ new

Returns daily flakiness rate for a specific test over the last N days. Instead of a single number, you get a timeline: 10% → 12% → 15% → 40% → 80%. An AI can read that curve and tell you exactly when things started deteriorating and whether a recent deploy correlates with the spike.

🕵️‍♂️ Case Study 1: The "Heisenbug" Hunt

A checkout test fails once every 10 runs. You open the trace — everything looks fine. You run it locally — it passes.

You ask the AI:

"Analyze the flakiness of the 'Checkout Flow' test."

The AI calls get_failure_patterns:

"This test has failed 5 times in the last 48 hours. 100% of failures occurred on WebKit. Chromium and Firefox have a 100% pass rate. This points to a Safari-specific timing or rendering issue, not a backend bug."

90% of the guesswork, gone.

🔥 Case Study 2: The Systemic Failure

Monday morning, CI shows 12 tests failing. Your first instinct: 12 separate problems to investigate.

You ask:

"Are these failures related? Check error groups for the last 24 hours."

The AI calls get_error_groups:

"All 12 tests share the same error signature: Error: Timeout 30000ms exceeded waiting for locator. 12 affected tests, 47 total failures in 24 hours. This is almost certainly a shared infrastructure issue — slow test environment or a missing wait, not 12 individual test bugs."

One root cause. One fix.

📉 Case Study 3: The Quiet Regression

A test has been flaky "forever" — the team just lives with it. But this week, something feels worse.

You ask:

"Show me the flakiness trend for the 'User Login' test over the last 30 days."

The AI calls get_flakiness_trend:

"The test was stable at 8–12% failure rate for three weeks. Starting May 10th, the rate jumped to 65%. That date aligns with the auth middleware refactor in PR #347."

Now you have a cause, a date, and a PR to bisect.

🚀 The Full Pipeline: Trace Decoder + Knowledge Graph

The real power comes from combining this with Playwright Trace Decoder MCP:

Knowledge Graph — which test is flaky, on what environment, and is it getting worse?
Trace Decoder — what exactly happened in the latest failure at the network/DOM level?

Together they form a complete AI-driven debugging pipeline: historical context from the knowledge graph, precise failure anatomy from the trace decoder. Minutes of manual investigation become seconds of automated insight.

⚡️ Quick Start

1. Build from source:

git clone https://github.com/vola-trebla/flakiness-knowledge-graph-mcp.git
cd flakiness-knowledge-graph-mcp
npm install && npm run build

2. Add the reporter to playwright.config.ts:

import { defineConfig } from '@playwright/test';

export default defineConfig({
  reporter: [
    ['html'],
    [
      '/absolute/path/to/flakiness-knowledge-graph-mcp/dist/reporter.js',
      { dbPath: './flakiness.db' }
    ],
  ],
});

Run your tests — the reporter writes every result automatically.

3. Connect the MCP server:

Claude Code:

claude mcp add flakiness-knowledge-graph \
  node /absolute/path/to/flakiness-knowledge-graph-mcp/dist/index.js

Cursor / VS Code (.cursor/mcp.json):

{
  "mcpServers": {
    "flakiness-knowledge-graph": {
      "command": "node",
      "args": ["/absolute/path/to/flakiness-knowledge-graph-mcp/dist/index.js"]
    }
  }
}

4. Ask your AI agent:

The DB is at /my-project/flakiness.db.

1. get_flaky_tests — what's most unreliable this week?
2. get_flakiness_trend for the top result — is it getting worse?
3. get_error_groups — do any failures share a root cause?
4. get_failure_patterns — browser or OS specific?

Stop retrying. Start analyzing. 🐸🕸️✨

DEV Community

Stop Guessing Why Your Tests Flake. Build a Knowledge Graph Instead.

🕸️ What is a Test Knowledge Graph?

🛠 Architecture: Three Moving Parts

1. The Playwright Reporter

2. The SQLite Backend

3. The MCP Interface

🧠 Six Analytical Tools

`get_flaky_tests`

`get_test_history`

`get_failure_patterns`

`get_slow_tests`

`get_error_groups` ⭐ new

`get_flakiness_trend` ⭐ new

🕵️‍♂️ Case Study 1: The "Heisenbug" Hunt

🔥 Case Study 2: The Systemic Failure

📉 Case Study 3: The Quiet Regression

🚀 The Full Pipeline: Trace Decoder + Knowledge Graph

⚡️ Quick Start

Top comments (0)

🕸️ What is a Test Knowledge Graph?

🛠 Architecture: Three Moving Parts

1. The Playwright Reporter

2. The SQLite Backend

3. The MCP Interface

🧠 Six Analytical Tools

get_flaky_tests

get_test_history

get_failure_patterns

get_slow_tests

get_error_groups ⭐ new

get_flakiness_trend ⭐ new

🕵️‍♂️ Case Study 1: The "Heisenbug" Hunt

🔥 Case Study 2: The Systemic Failure

📉 Case Study 3: The Quiet Regression

🚀 The Full Pipeline: Trace Decoder + Knowledge Graph

⚡️ Quick Start

`get_flaky_tests`

`get_test_history`

`get_failure_patterns`

`get_slow_tests`

`get_error_groups` ⭐ new

`get_flakiness_trend` ⭐ new