If you're an SDET or a frontend engineer, you know the drill.
You're sipping your morning coffee when a Slack alert pops up: "CI Build Failed: E2E Tests". You open GitHub Actions, download the 150MB trace.zip artifact, run npx playwright show-trace, wait for the UI to load, click through the timeline, dig into the network tabs, and finally spot the 500 error or the missing DOM element.
Then — because it's 2026 and we use AI for everything — you copy the error log, grab a snippet of the HTML, and paste it into Claude or ChatGPT: "Why did this fail?"
It's tedious. It's manual. It's exactly the kind of repetitive work AI was supposed to eliminate.
The problem? LLMs can't natively read a binary trace.zip. And even if you extract the raw JSON, it's massive — often exceeding the context window with useless static assets and bloated DOM dumps.
This article walks through how I built an open-source MCP server that solves this.
What is MCP?
The Model Context Protocol is an open standard that lets AI agents securely call external tools from within a conversation. Instead of pasting data into a chat window, the agent calls a tool, gets a structured response, and reasons over it — all autonomously.
Think of it as a type-safe API layer between your AI and your local environment.
The Architecture
The server — playwright-trace-decoder-mcp — unpacks the trace.zip, parses its internal JSONL files, and exposes 13 focused tools the agent can call to investigate a failure.
trace.zip
├── *.trace → JSONL: before/after action pairs, console events, DOM snapshots
├── *.network → JSONL: HAR resource-snapshot entries
└── resources/
├── page@*.jpeg → screenshots taken during the run
└── ... → fonts, stylesheets, other captured resources
Three design decisions keep it practical:
1. In-process LRU cache. The parser caches results keyed by path + mtime. Re-reading the same unmodified trace costs zero I/O. The cache is capped at 50 entries so a long-running CI server doesn't leak memory.
2. Streaming JSONL parser. Trace files are read line-by-line, never fully buffered. No OOM on 500MB traces.
3. Pagination everywhere. Every list-returning tool takes limit and offset with a has_more flag. The agent never accidentally dumps 800 actions into its context window.
The Killer Feature: ARIA-as-YAML
Raw HTML snapshots are enormous. A typical DOM snapshot from a real app easily runs 50,000+ tokens — blowing up the context window before the agent can even reason about it.
The get_aria_accessibility_tree tool translates the raw snapshot into a compact YAML representation of the accessibility tree:
Instead of:
<div class="flex flex-col gap-4 p-8 bg-white rounded-xl shadow-lg">
<h1 class="text-2xl font-bold text-gray-900">Dashboard</h1>
<button class="px-4 py-2 bg-blue-600 text-white rounded-lg hover:bg-blue-700">
Submit
</button>
</div>
The agent sees:
- document
- main
- heading "Dashboard"
- button "Submit"
~90% token reduction. The agent gets exactly what it needs to know: Was the element actually rendered? Was it visible? Was it a button or a div?
The Full Toolset
Tools are grouped by how an agent should sequence them:
Inspection
-
get_test_metadata— browser, platform, viewport, test title -
get_trace_summary— failing action + error message + total action count -
get_action_timeline— paginated timeline of all actions with locators and timings -
get_filtered_network_logs— only 4xx/5xx responses, static assets stripped -
get_console_errors— JS exceptions and warnings -
get_element_state_at_failure— the exact locator and raw metadata at failure time
DOM / UI Analysis
-
get_aria_accessibility_tree— compact YAML DOM at any action, defaults to failure -
get_dom_mutation_delta— what was added/removed from the DOM by a specific action -
get_screenshot_at_failure— base64 JPEG from the trace, closest to the failure timestamp -
analyze_race_conditions— network requests still in-flight when an interaction fired
Root-Cause Analysis
-
get_causal_chain_for_failure— chronological chain of actions, network errors, and console errors in the N seconds before failure -
generate_error_signature— stable 12-char SHA-1 hash for grouping duplicate failures across parallel CI runs -
compare_traces— LCS-aligned diff between a passing and failing run of the same test
Case Study: Google vs. The Bots
Here's a real example of the server in action.
We had a simple Page Object Model setup testing Google Search. Our test suddenly failed with:
Error: expect(page).toHaveTitle(/жабы/i) failed
Generic. Useless. Normally I'd have to download the trace, scrub through the network tab, and look at screenshots to figure out what happened.
Instead, I pointed the AI at the failing trace.zip and a previous passing trace.zip.
The agent called compare_traces. Here's what came back (simplified):
{
"first_structural_divergence": {
"failing_index": 7,
"description": "after matched action 'page.goto': 0 unmatched in passing, 3 unmatched in failing"
},
"network_summary": {
"only_in_failing": [
"https://www.google.com/recaptcha/enterprise/...",
"https://www.google.com/sorry/index?..."
]
}
}
The agent's response:
"The test failed because Google detected browser automation and presented a CAPTCHA. In the failing trace, there are network requests to
recaptcha/enterpriseand a redirect to/sorry/indexthat do not exist in the passing trace. The redirect changed the page title, causing thetoHaveTitleassertion to fail."
Root cause identified in under 5 seconds. No UI opened. No manual log parsing.
When the ARIA Tree Is Empty
Sometimes the page is blank, redirected, or shows an error page — and the ARIA tree comes back as just - document. That's where get_screenshot_at_failure saves you.
The tool scans the zip for resources/page@*.jpeg, parses timestamps from filenames, and returns the screenshot closest to the moment of failure as base64 — along with how many milliseconds before the failure it was taken.
The agent can see exactly what was on screen. In the captcha case, it would have shown the CAPTCHA challenge page directly.
Setup
Clone and build:
git clone https://github.com/vola-trebla/playwright-trace-decoder-mcp.git
cd playwright-trace-decoder-mcp
npm install && npm run build
Add to Cursor (.cursor/mcp.json) or VS Code (.vscode/mcp.json):
{
"mcpServers": {
"playwright-trace-decoder": {
"command": "node",
"args": ["/absolute/path/to/playwright-trace-decoder-mcp/dist/index.js"]
}
}
}
Or Claude Code:
claude mcp add playwright-trace-decoder \
node /absolute/path/to/playwright-trace-decoder-mcp/dist/index.js
Restart your editor, and the tools will appear in the agent's available toolset.
The Prompt
Once connected, drop this into your agent:
The CI run failed. Analyze this trace:
/path/to/test-results/my-test/trace.zip
1. get_trace_summary — what failed?
2. get_causal_chain_for_failure — what led up to it?
3. get_aria_accessibility_tree — what did the page look like?
4. get_screenshot_at_failure — if the ARIA tree is unhelpful, show me the screenshot
5. analyze_race_conditions — was a network request still pending?
Give me a root cause analysis.
The Impact
Old flow (10–15 min):
Download trace → npx playwright show-trace → inspect timeline → inspect network → inspect DOM → formulate hypothesis → fix code.
New flow (< 1 min):
Drop trace path into chat → read the root cause analysis → approve the fix.
For flaky tests, compare_traces makes it even more powerful. Instead of manually diffing two runs, the agent aligns the action sequences using LCS and tells you exactly where the two runs diverged — structurally, by timing, or by network activity.
The server is open source and under active development. Contributions welcome.
GitHub: vola-trebla/playwright-trace-decoder-mcp
Happy testing. 🎭
Top comments (0)