Albert Alov

Posted on May 15

Stop Copy-Pasting Playwright Traces into ChatGPT. Do This Instead.

#playwright #ai #mcp #testing

If you're an SDET or a frontend engineer, you know the drill.

You're sipping your morning coffee when a Slack alert pops up: "CI Build Failed: E2E Tests". You open GitHub Actions, download the 150MB trace.zip artifact, run npx playwright show-trace, wait for the UI to load, click through the timeline, dig into the network tabs, and finally spot the 500 error or the missing DOM element.

Then — because it's 2026 and we use AI for everything — you copy the error log, grab a snippet of the HTML, and paste it into Claude or ChatGPT: "Why did this fail?"

It's tedious. It's manual. It's exactly the kind of repetitive work AI was supposed to eliminate.

The problem? LLMs can't natively read a binary trace.zip. And even if you extract the raw JSON, it's massive — often exceeding the context window with useless static assets and bloated DOM dumps.

This article walks through how I built an open-source MCP server that solves this.

What is MCP?

The Model Context Protocol is an open standard that lets AI agents securely call external tools from within a conversation. Instead of pasting data into a chat window, the agent calls a tool, gets a structured response, and reasons over it — all autonomously.

Think of it as a type-safe API layer between your AI and your local environment.

The Architecture

The server — playwright-trace-decoder-mcp — unpacks the trace.zip, parses its internal JSONL files, and exposes 13 focused tools the agent can call to investigate a failure.

trace.zip
  ├── *.trace          → JSONL: before/after action pairs, console events, DOM snapshots
  ├── *.network        → JSONL: HAR resource-snapshot entries
  └── resources/
      ├── page@*.jpeg  → screenshots taken during the run
      └── ...          → fonts, stylesheets, other captured resources

Three design decisions keep it practical:

1. In-process LRU cache. The parser caches results keyed by path + mtime. Re-reading the same unmodified trace costs zero I/O. The cache is capped at 50 entries so a long-running CI server doesn't leak memory.

2. Streaming JSONL parser. Trace files are read line-by-line, never fully buffered. No OOM on 500MB traces.

3. Pagination everywhere. Every list-returning tool takes limit and offset with a has_more flag. The agent never accidentally dumps 800 actions into its context window.

The Killer Feature: ARIA-as-YAML

Raw HTML snapshots are enormous. A typical DOM snapshot from a real app easily runs 50,000+ tokens — blowing up the context window before the agent can even reason about it.

The get_aria_accessibility_tree tool translates the raw snapshot into a compact YAML representation of the accessibility tree:

Instead of:

<div class="flex flex-col gap-4 p-8 bg-white rounded-xl shadow-lg">
  <h1 class="text-2xl font-bold text-gray-900">Dashboard</h1>
  <button class="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700">
    Submit
  </button>
</div>

The agent sees:

- document
  - main
    - heading "Dashboard"
    - button "Submit"

~90% token reduction. The agent gets exactly what it needs to know: Was the element actually rendered? Was it visible? Was it a button or a div?

The Full Toolset

Tools are grouped by how an agent should sequence them:

Inspection

get_test_metadata — browser, platform, viewport, test title
get_trace_summary — failing action + error message + total action count
get_action_timeline — paginated timeline of all actions with locators and timings
get_filtered_network_logs — only 4xx/5xx responses, static assets stripped
get_console_errors — JS exceptions and warnings
get_element_state_at_failure — the exact locator and raw metadata at failure time

DOM / UI Analysis

get_aria_accessibility_tree — compact YAML DOM at any action, defaults to failure
get_dom_mutation_delta — what was added/removed from the DOM by a specific action
get_screenshot_at_failure — base64 JPEG from the trace, closest to the failure timestamp
analyze_race_conditions — network requests still in-flight when an interaction fired

Root-Cause Analysis

get_causal_chain_for_failure — chronological chain of actions, network errors, and console errors in the N seconds before failure
generate_error_signature — stable 12-char SHA-1 hash for grouping duplicate failures across parallel CI runs
compare_traces — LCS-aligned diff between a passing and failing run of the same test

Case Study: Google vs. The Bots

Here's a real example of the server in action.

We had a simple Page Object Model setup testing Google Search. Our test suddenly failed with:

Error: expect(page).toHaveTitle(/жабы/i) failed

Generic. Useless. Normally I'd have to download the trace, scrub through the network tab, and look at screenshots to figure out what happened.

Instead, I pointed the AI at the failing trace.zip and a previous passing trace.zip.

The agent called compare_traces. Here's what came back (simplified):

{
  "first_structural_divergence": {
    "failing_index": 7,
    "description": "after matched action 'page.goto': 0 unmatched in passing, 3 unmatched in failing"
  },
  "network_summary": {
    "only_in_failing": [
      "https://www.google.com/recaptcha/enterprise/...",
      "https://www.google.com/sorry/index?..."
    ]
  }
}

The agent's response:

"The test failed because Google detected browser automation and presented a CAPTCHA. In the failing trace, there are network requests to recaptcha/enterprise and a redirect to /sorry/index that do not exist in the passing trace. The redirect changed the page title, causing the toHaveTitle assertion to fail."

Root cause identified in under 5 seconds. No UI opened. No manual log parsing.

When the ARIA Tree Is Empty

Sometimes the page is blank, redirected, or shows an error page — and the ARIA tree comes back as just - document. That's where get_screenshot_at_failure saves you.

The tool scans the zip for resources/page@*.jpeg, parses timestamps from filenames, and returns the screenshot closest to the moment of failure as base64 — along with how many milliseconds before the failure it was taken.

The agent can see exactly what was on screen. In the captcha case, it would have shown the CAPTCHA challenge page directly.

Setup

Clone and build:

git clone https://github.com/vola-trebla/playwright-trace-decoder-mcp.git
cd playwright-trace-decoder-mcp
npm install && npm run build

Add to Cursor (.cursor/mcp.json) or VS Code (.vscode/mcp.json):

{
  "mcpServers": {
    "playwright-trace-decoder": {
      "command": "node",
      "args": ["/absolute/path/to/playwright-trace-decoder-mcp/dist/index.js"]
    }
  }
}

Or Claude Code:

claude mcp add playwright-trace-decoder \
  node /absolute/path/to/playwright-trace-decoder-mcp/dist/index.js

Restart your editor, and the tools will appear in the agent's available toolset.

The Prompt

Once connected, drop this into your agent:

The CI run failed. Analyze this trace:
/path/to/test-results/my-test/trace.zip

1. get_trace_summary — what failed?
2. get_causal_chain_for_failure — what led up to it?
3. get_aria_accessibility_tree — what did the page look like?
4. get_screenshot_at_failure — if the ARIA tree is unhelpful, show me the screenshot
5. analyze_race_conditions — was a network request still pending?

Give me a root cause analysis.

The Impact

Old flow (10–15 min):
Download trace → npx playwright show-trace → inspect timeline → inspect network → inspect DOM → formulate hypothesis → fix code.

New flow (< 1 min):
Drop trace path into chat → read the root cause analysis → approve the fix.

For flaky tests, compare_traces makes it even more powerful. Instead of manually diffing two runs, the agent aligns the action sequences using LCS and tells you exactly where the two runs diverged — structurally, by timing, or by network activity.

The server is open source and under active development. Contributions welcome.

GitHub: vola-trebla/playwright-trace-decoder-mcp

Happy testing. 🎭