DEV Community

Takayuki Kawazoe
Takayuki Kawazoe

Posted on

Stop AI from hallucinating E2E test selectors — code analysis + live browser exploration via Claude Agent SDK and 2 MCP servers

Generating E2E tests with an LLM sounds great in a demo. You hand a Playwright test spec to Claude, ask it to produce TypeScript code, and the toy app passes.

Plug it into a real codebase and the wheels come off immediately. The AI confidently generates await page.click('#login-button') for a project where the actual element is <button data-testid="auth-submit">Sign in</button>. Selectors are invented from common patterns ("most projects use #login-button") rather than read from your code or your DOM. Result: ~every generated test fails on first run, because the selectors are pure hallucination.

This post is the architecture I shipped to fix that — how I make the AI read the codebase and drive an actual browser before it writes a single line of test code, and why the implementation choices look the way they do.

The architecture in one diagram

The agent gets two MCP (Model Context Protocol) servers wired in parallel:

              ┌──────────────────────────────────┐
              │     Claude Agent SDK Client      │
              │                                  │
   ┌──────────┤   mcp_servers = {                │
   │          │     "local":      <SDK MCP>,     │
   │          │     "playwright": <SDK MCP>      │
   │          │   }                              │
   │          └──────┬───────────────────────────┘
   │                 │
   ▼                 ▼
[ workspace        [ Playwright API
  file read ]        (Chromium running) ]
Enter fullscreen mode Exit fullscreen mode

The "local" MCP server exposes file-read / list-directory tools backed by the developer's repo. The "playwright" MCP server exposes browser-control tools — navigate, snapshot, click, type — backed by an actual chromium.launch() instance.

Critically, both MCP servers run inside the same Python process as the agent client. Anthropic publishes a Docker image for an "official" Playwright MCP, but I deliberately don't use it — more on that below.

Here's the wiring (Python, with the Anthropic claude-agent-sdk):

# infrastructure/external_apis/exploratory_agent_client.py
from claude_agent_sdk import (
    ClaudeAgentOptions,
    ClaudeSDKClient,
    create_sdk_mcp_server,
    tool,
)
from playwright.async_api import async_playwright

@tool(
    "browser_navigate",
    "Navigate to a URL in the browser.",
    {"url": str},
)
async def browser_navigate_tool(args: dict[str, Any]) -> dict[str, Any]:
    url = args["url"]
    await page.goto(url, wait_until="domcontentloaded", timeout=30000)
    return {"status": "success", "url": page.url, "title": await page.title()}
Enter fullscreen mode Exit fullscreen mode

Each @tool decorator turns a Python function into an MCP tool the agent can call. The Playwright page object is captured in the closure, so the tool drives a single, persistent browser session that the agent navigates step by step.

Why in-process MCP, not Docker MCP

Anthropic ships mcr.microsoft.com/playwright/mcp as a Docker image. The "right" thing on paper is to spawn that container as a subprocess of the agent and let stdio do the talking.

In practice, my agent runs inside a Celery worker on ECS Fargate. Docker-in-Docker on Fargate is a tax: Fargate doesn't expose /var/run/docker.sock by default, the workarounds (sysbox, Docker rootless inside a container, side-cars) all add operational complexity, and you lose the ability to attach event listeners to the Playwright browser from the agent process.

In-process MCP via create_sdk_mcp_server() sidesteps all of this. The Playwright page is just a Python variable. I can attach page.on("console", ...) and page.on("request", ...) listeners outside the tools, and have the tools query that captured state:

console_messages: list[dict] = []
network_requests: list[dict] = []

page.on("console", lambda msg: console_messages.append({
    "type": msg.type, "text": msg.text,
}))
page.on("request", lambda req: network_requests.append({
    "url": req.url, "method": req.method, "resource_type": req.resource_type,
}))

@tool(
    "browser_console_messages",
    "Get console messages captured from the browser (errors, warnings, logs).",
    {},
)
async def browser_console_messages_tool(args):
    return {"status": "success", "messages": console_messages[-50:]}
Enter fullscreen mode Exit fullscreen mode

When the agent generates a test and the test causes a JS console error, the agent can call browser_console_messages to see it and write an expect(consoleErrors).toEqual([]) assertion guarding against it. With Docker MCP this is much harder — the listener state lives in a different process from the tool implementation.

The 3-phase flow

Just letting the agent loose on "URL + workspace, generate tests" produces sloppy results. I prompt it through three explicit phases.

Phase 1: Code analysis

Tools available: read_file, list_directory (the local MCP).

The system prompt tells the agent to:

  1. Identify the framework (Next.js, React, Vue, etc.) and routing pattern
  2. Locate page components, forms, key UI flows
  3. Extract data-testid / aria-label patterns this project uses
  4. Output a JSON summary of detected_pages, detected_routes, detected_forms

The point of this phase is to make the agent discover this project's selector conventions before it generates any test code. If the project uses data-testid="page-login-submit" consistently, the agent learns the pattern; later, when it sees the actual DOM, it gravitates toward selectors that match the convention.

Phase 2: Browser exploration

Tools available: browser_navigate, browser_snapshot, browser_click, browser_type, browser_console_messages, etc. (the Playwright in-process MCP).

browser_snapshot is the linchpin. Returning the raw HTML of the page would burn hundreds of thousands of tokens for a single SPA. Instead, the tool returns the accessibility tree as a list of refs:

@tool("browser_snapshot", "Get accessibility tree snapshot ...", {"full_page": bool})
async def browser_snapshot_tool(args):
    snapshot = await page.accessibility.snapshot()
    elements = []
    ref_counter = [0]

    def extract_elements(node, path=""):
        if node is None: return
        role = node.get("role", "")
        if role in {"button", "link", "textbox", "checkbox", "radio",
                    "combobox", "listbox", "option", "menuitem", "tab"}:
            ref_counter[0] += 1
            elements.append({
                "ref": f"e{ref_counter[0]}",
                "role": role,
                "name": node.get("name", ""),
                "path": path,
            })
        for i, child in enumerate(node.get("children", [])):
            extract_elements(child, f"{path}/{i}")

    if snapshot:
        extract_elements(snapshot)

    return {
        "status": "success",
        "elements": elements[:100],
        "element_count": len(elements),
    }
Enter fullscreen mode Exit fullscreen mode

The agent sees something like:

[
  {"ref": "e1", "role": "textbox", "name": "Email"},
  {"ref": "e2", "role": "textbox", "name": "Password"},
  {"ref": "e3", "role": "button", "name": "Sign in"}
]
Enter fullscreen mode Exit fullscreen mode

Then it can issue browser_type {ref: "e1", text: "test@example.com"} and the tool resolves the ref back to the live element. This is the same pattern the official Playwright MCP uses, and it's the right tradeoff: dramatically less token usage, and the agent operates on stable, semantic refs instead of pixel coordinates or fragile CSS.

Phase 3: Test synthesis

Now the agent has:

  • (from Phase 1) The repo's selector convention (e.g. data-testid="page-{name}-{element}")
  • (from Phase 2) The actual ref/role/name for every element on every page it explored
  • (also from Phase 2) The console-error history during exploration

I hand it all of that and ask for Playwright TypeScript code. The output is qualitatively different from a 1-shot generation:

  • Selectors match the project's convention instead of generic #button
  • Selectors target real DOM that exists, because the agent saw them in browser_snapshot
  • Console-error checkpoints get encoded as expect assertions
  • Steps that produced visible network errors during exploration become explicit await page.waitFor* calls

Lessons after running this in production

Three things that aren't obvious from the diagram:

1. Show the AI both the source code AND the live DOM. Spec-only prompts will hallucinate selectors. Code-analysis-only will guess at runtime behavior. Browser-exploration-only will ignore project conventions. You need both, fed in as separate phases so the agent has enough context to bridge them.

2. In-process MCP beats Docker MCP for a cloud-deployed agent. It's an operational convenience as much as a technical one — you keep the Playwright page object accessible from the same Python process that runs the agent, you can attach event listeners outside tools, and you don't fight Fargate over Docker-in-Docker.

3. Accessibility-tree refs are the right token-economy move. Returning raw HTML in browser_snapshot would 10x the cost per exploration and isn't even the most useful representation. Returning a flat list of {ref, role, name} is faster, cheaper, and aligns the agent's "click this element" mental model with what's stable across DOM changes.

What I'm still working on

The 100-element cap in browser_snapshot is a heuristic — pages with massive product grids overflow it and the agent can miss interactive elements past element 100. I'd like to make it pagination-aware (let the agent ask for "elements 100–200" if it suspects something is missing) without making the prompt longer. Suggestions welcome.

The agent is also still bad at flows that span an email round-trip (signup → magic link → continue). MCP-driven email-inbox access could close that gap, but I haven't built it yet.

This pattern is shipped in a SaaS I've been building solo for the last 6 months — Codens, an AI development suite where 5 specialized agents share a credit pool. The exploratory E2E generator above is the "Blue Codens" service. I'm submitting Codens to Show HN tonight (2026-04-29 23:00 JST). If you'd like the pricing-transparency angle of the launch, I write about the per-token rate card on the LP — full credit math is published, customers can compute their own task cost before they run anything.

Feedback on the architecture, especially from anyone who's shipped Claude Agent SDK to production, very welcome.

Top comments (0)