This article is for anyone writing Playwright tests with AI assistance
TL;DR: AI agents using Playwright MCP often write positional selectors like getByRole('button', { name: 'Add to Cart' }).nth(8) or chain parent traversals like locator('..').locator('..') because accessibility snapshots omit non-semantic containers. Verdex is an experimental MCP Server that solves this through progressive disclosure at two layers:
Layer 1 - Three DOM exploration primitives (resolve_container, inspect_pattern, extract_anchors) reveal structural information incrementally, using ~3k tokens per exploration instead of 50k+ token DOM dumps.
Layer 2 - AI assistant configuration (cursor rules, Claude skills, or similar) teaches LLMs how to compose these tools correctly, loading instructions progressively.
Together, these layers enable LLMs to generate stable, role-first, container-scoped selectors without information overload or manual guidance.
Verdex was built to solve two core problems:
• Brittle order-based selectors that break when DOM structure changes due to code refactors, hydration mismatches, or dynamic content loading
• Multi-role test authoring for complex E2E workflows (e.g. admin ↔ user ↔ customer)
Tech: TypeScript, Puppeteer (CDP), isolated JavaScript execution contexts.
Outputs Playwright Code: Verdex is an authoring-time intelligence layer, not a runtime. Verdex's output is pure Playwright code that you own and run anywhere.
GitHub: https://github.com/verdexhq/verdex-mcp
Why did I build Verdex on CDP? Comprehensive answer here.
Article Roadmap
Part 1: The Problem (4 min read)
- Why accessibility snapshots aren't enough for AI test authoring
- The brittle selector problem:
.nth(8)and parent traversal chains - How accessibility trees hide structural anchors by design
Part 2: The Foundation (3 min read)
- Accessibility snapshots with DOM mapping
- How refs become entry points for structural exploration
- The hybrid approach: semantic view + structural queries
Part 3: Progressive Disclosure in Action (8 min read)
- The four-level workflow: overview → containers → patterns → anchors
- Complete product card example with code and token counts
- Why 3k tokens beats 50k+ DOM dumps
Part 4: Teaching LLMs to Compose (5 min read)
- Why tools alone aren't enough (the compositional knowledge gap)
- AI assistant configuration as progressive knowledge delivery
- The parallel architecture: capability + strategy
Part 5: Implementation Deep Dive (6 min read)
- CDP, Puppeteer, and isolated JavaScript execution contexts
- Bridge architecture and the ref-to-DOM mapping
- Multi-role browser isolation for complex E2E flows
Part 6: Lessons and Open Questions (4 min read)
- Design principles: facts not interpretations, bounded output, composition
- Limitations and trade-offs
- What we still don't know about primitive decomposition
The Architecture in Brief
Before diving into the details, it's important to understand the design principle that runs through every part of the system: progressive disclosure.
Progressive disclosure is the practice of revealing information incrementally—showing only what's needed for the next decision instead of overwhelming with everything upfront. Verdex applies this principle at two critical layers:
Layer 1: Progressive Structure Discovery (DOM Exploration): Instead of dumping the entire DOM (50k+ tokens), Verdex reveals structure incrementally through three primitives: resolve_container → inspect_pattern → extract_anchors. Each returns targeted structural facts (~3k tokens), allowing LLMs to build understanding step-by-step without information overload.
Layer 2: Progressive Knowledge Loading (AI Configuration): Instead of loading all guidance upfront, AI assistant configuration files (cursor rules, Claude skills, or similar) reveal instructions incrementally: metadata (always loaded, ~100 tokens) → core instructions (when triggered, ~5k tokens) → detailed guides (as needed, ~15k+ tokens). This teaches LLMs how to compose the tools correctly while keeping token costs low.
Both layers solve the same fundamental problem: information overload degrades LLM performance. Whether it's raw DOM data or procedural knowledge, dumping everything at once reduces accuracy and increases costs. Progressive disclosure keeps LLMs focused on what matters for each specific task.
This article explains how both layers work together to enable LLMs to generate stable, role-first, container-scoped Playwright selectors.
Part 1: The Problem
1.1 Brittle Selectors and Missing Structure
Nine months ago, I started building a comprehensive Playwright test suite for a large SaaS application (~500k LOC, thousands of components). I used the standard tooling: Playwright's codegen, inspector, and when it was released, the Playwright MCP server for AI-assisted test authoring.
As much as I initially loved Playwright MCP when it was released, I noticed I spent a lot of my time context switching to fix selectors that coding agents generated - which consistently looked like this:
getByRole('button', { name: 'Add to Cart' }).nth(8)
Instead of what I actually needed—a role-first selector that's properly scoped to the right container:
getByTestId("product-card")
.filter({ hasText: "iPhone 15 Pro" })
.getByRole("button", { name: "Add to Cart" })
This follows Playwright's documented best practices: prioritize user-facing attributes (roles), scope to stable containers (test IDs or landmarks), and use content filters for disambiguation.
The first selector is brittle—if the order of page elements changes at all (e.g., due to refactors or client-side hydration differences or because of dynamic content differences), the test breaks.
Initially, this seemed like a prompting problem. But after deep diving into the Playwright source code, I realized it wasn't. It's a structural information problem: The accessibility tree focuses on semantically meaningful nodes. Generic structural containers—like <div>'s used purely for layout, even those with data-testid or class names—are either collapsed away or appear as bare "generic" nodes without their structural attributes in the accessibility tree. This keeps the view clean and standards-compliant, but removes the structural anchors agents need for scoping.
1.2 Accessibility-First API's
Modern test tooling (Playwright, Testing Library) intentionally exposes an accessibility-focused view of the DOM. This is brilliant for helping humans and LLMs retain a semantic view of the page:
- You think like a user, not a DOM archaeologist
- Tests stay resilient to implementation changes
- Forces you to build accessible UIs
Here's what an example accessibility snapshot shows for product cards:
generic
heading "iPhone 15 Pro"
button "Add to Cart" [ref=e3]
generic
heading "MacBook Pro"
button "Add to Cart" [ref=e4]
This tells you the elements are semantically grouped, but that single generic container could represent any of these actual DOM structures:
<!-- Possibility 1: Simple flat structure -->
<div data-testid="product-card">
<h3>iPhone 15 Pro</h3>
<button>Add to Cart</button>
</div>
<!-- Possibility 2: Deep nesting with component boundary at top -->
<div data-testid="product-card">
<div class="card-header">
<div class="title-wrapper">
<h3>iPhone 15 Pro</h3>
</div>
</div>
<div class="card-body">
<div class="actions">
<button>Add to Cart</button>
</div>
</div>
</div>
<!-- Possibility 3: Multiple potential component boundaries -->
<article data-testid="product-card">
<header class="product-header">
<h3>iPhone 15 Pro</h3>
</header>
<footer class="product-footer">
<button>Add to Cart</button>
</footer>
</article>
How normalization hides structural anchors:
The accessibility tree intentionally collapses all three structures into the same generic node because:
- Elements without semantic ARIA roles (
<div>,<article>,<header>,<footer>) →generic - Non-ARIA attributes (
data-testid,class,id) → stripped from the tree - Layout wrapper elements → flattened away
This normalization is correct per W3C accessibility specifications—it ensures screen readers and accessibility tools see a clean semantic structure. But it can also remove the structural anchors that test authors need for writing stable selectors.
This isn't a matter of prompt engineering or model capability. It's information-theoretically impossible to generate getByTestId("product-card") when data-testid="product-card" is not present in the input. The accessibility tree—by design and specification—omits these attributes.
The Playwright team would reasonably argue that the solution is refactoring your HTML to be more semantic—ensuring proper ARIA roles and accessible structure. While this is the ideal approach, it's often impossible or impractical in real-world scenarios: legacy codebases with thousands of components, third-party dependencies you don't control, resource constraints that prioritize features over refactoring, and tight timelines. Verdex bridges the gap between best practice and current reality, enabling stable test authoring against imperfect DOM structures.
1.3 The Gap: Component Boundaries Aren't Visible
Without seeing the actual DOM structure, LLMs cannot determine:
Where component boundaries exist: Is the stable container one level up? Four levels up? At different levels for heading vs button?
What anchors are available: Does that boundary have a
data-testid,id, or semantic role? Or is it a bare<div>?How to scope reliably: Which scoping patterns will actually work for this structure?
This creates a difficult situation for AI agents:
// Option 1: Try parent traversal
page.getByRole('heading', { name: 'iPhone 15 Pro' })
.locator('..').locator('..') // How many levels?
.getByRole('button', { name: 'Add to Cart' })
// Option 2: Use nth() as a safe fallback
page.getByRole('button', { name: 'Add to Cart' }).nth(8)
The LLM often falls back to nth() because it's deterministic when component boundaries aren't visible.
1.4 Failed Fix: Dumping the DOM
My first attempt was straightforward: give the LLM the full DOM and let it figure it out.
// Serialize the entire DOM
const domTree = document.documentElement.outerHTML;
// Send to LLM (~10k-100k tokens for a complex page)
This failed for two reasons:
Token Cost
A complex page in the production application I was testing consumed 50k+ tokens per query. This meant:
- Each selector exploration required 2-3 queries (finding element → validating → refining)
- That's 150k+ tokens per selector
For context (in my environment): each selector exploration often required multiple queries, pushing costs and latency up on complex pages.
Signal-to-Noise Ratio
Most of the DOM is irrelevant for any given selector. The LLM had to wade through:
- Styling markup
- Script tags
- Hidden UI elements
- Unrelated page sections
Even when using strong coding models, accuracy degraded when working with massive context windows. The model would hallucinate elements or generate selectors based on irrelevant DOM sections.
I needed a better approach: a way to progressively reveal structural information without overwhelming the LLM with 50k+ token DOM dumps.
The key insight was that I didn't need to choose between "clean but incomplete accessibility tree" or "complete but overwhelming DOM dump." There was a third option—but first, I needed to build the right foundation.
Part 2: The Foundation
2.1 Accessibility Snapshots with DOM Mapping
Before diving into how Verdex enables progressive structural exploration, it's important to understand the technical foundation that makes it possible.
Verdex generates an accessibility tree similar to Playwright MCP's snapshots—showing the semantic structure of the page with roles, names, and ARIA properties. Like Playwright MCP, every interactive element gets a reference ID (e.g., e1, e2) that maps to its underlying DOM node.
The key difference:
- Playwright MCP treats refs as opaque handles for direct interactions—click, type, and evaluate operations.
- Verdex uses those same refs as entry points for structural exploration, exposing DOM hierarchy queries that enable LLMs to discover container boundaries, identify repeating patterns, and locate unique anchors.
Here's what a Verdex snapshot looks like:
navigation
link "Home" [ref=e1]
link "Products" [ref=e2]
main
heading "Featured Products"
generic
heading "iPhone 15 Pro"
button "Add to Cart" [ref=e3]
generic
heading "MacBook Pro"
button "Add to Cart" [ref=e4]
The snapshot presents the clean semantic structure visible to users, excluding implementation details such as data-testid attributes or class names. This keeps the view focused on semantically meaningful interactive elements.
When structural context is needed for selector construction, Verdex's DOM exploration tools become relevant. Using the same e3 reference, you can inspect the corresponding DOM structure to trace elements back to their rendered source.
Calling resolve_container(e3) returns:
Level 1 (div):
Attributes: {"data-testid": "product-card"}
Level 2 (div):
Attributes: {"data-testid": "product-grid"}
Level 3 (section):
Attributes: {"class": "products"}
This hybrid approach gives you the best of both worlds:
From accessibility snapshots: Clean semantic view of the page that mirrors how users (and screen readers) experience it. No noise from implementation details.
From structural exploration tools: The ability to efficiently explore DOM relationships using the stored mapping. When you see button "Add to Cart" [ref=e3], the LLM can immediately call resolve_container(e3) to discover it's inside a product-card container.
Like Playwright MCP, the mapping happens through a bridge running in an isolated JavaScript execution context. During snapshot generation, the bridge:
- Traverses the DOM following W3C ARIA specifications
- Identifies which elements contribute to the accessibility tree
- Assigns reference IDs to interactive elements
- Stores the mapping in a
Map<string, ElementInfo>where each entry contains both the element reference and its DOM node
Verdex's innovation is exposing structural exploration primitives on top of this foundation, rather than limiting refs to direct element interactions.
2.2 Token Efficiency Example
Without DOM mapping, you'd need to send the full DOM context to identify containers:
<div class="app">
<section class="products">
<div data-testid="product-grid">
<div data-testid="product-card">
<h3>iPhone 15 Pro</h3>
<span class="price">$999</span>
<button>Add to Cart</button> <!-- Which button? -->
With Verdex's mapping, you get exactly what you need:
Level 1 (div):
Attributes: {"data-testid": "product-card"}
Level 2 (div):
Attributes: {"data-testid": "product-grid"}
Level 3 (section):
Attributes: {"class": "products"}
With this foundation in place—accessibility snapshots that provide clean semantic views AND ref-to-DOM mappings that enable structural queries—Verdex can expose three primitives that use these refs as entry points for progressive exploration.
Part 3: Progressive Disclosure in Action
3.1 The Verdex Approach: Progressive Disclosure of DOM Structure
Building on the accessibility snapshot foundation, Verdex implements progressive disclosure through three exploration primitives that compose into a disclosure hierarchy.
Instead of forcing a binary choice between "clean but incomplete accessibility tree" or "complete but overwhelming DOM dump," Verdex provides a third option: start with the semantic overview, then progressively disclose structural details through targeted queries.
Each primitive uses the ref IDs from accessibility snapshots as entry points, then returns just enough structural information for the LLM to decide whether it needs more detail—keeping total exploration under 3k tokens instead of 50k+ for full DOM dumps.
The primitives return deterministic, low-level structural facts that the model can compose to keep role-first selectors properly container-scoped and resilient to layout or hydration changes.
Here's the disclosure hierarchy:
Level 1: Semantic Overview (browser_snapshot): Shows what users see: roles, names, interactive elements. Hides implementation details like class names and test IDs.
Level 2: Structural Containers (resolve_container): Reveals the containment hierarchy: which divs, sections, or articles wrap the target element, along with their stable attributes.
Level 3: Repeating Patterns (inspect_pattern): Exposes sibling elements at a specific depth to identify lists, grids, or card layouts—making component boundaries deducible.
Level 4: Unique Anchors (extract_anchors): Mines distinctive content within a container: headings, labels, test IDs, or unique text that can differentiate one card from another.
Let's examine each primitive in detail:
3.2 resolve_container(ref)
Returns the full ancestor chain (up to document.body) with minimal metadata the model can use to pick stable scoping containers—including examples like data-testid, id, semantic landmarks (main, nav, article, section), and class-name patterns.
Example output:
{
"target": { "ref": "e3", "tagName": "button", "text": "Add to Cart" },
"ancestors": [
{ "level": 1, "tagName": "div", "attributes": { "data-testid": "product-card" }, "childElements": 5, "containsRefs": [] },
{ "level": 2, "tagName": "div", "attributes": { "class": "product-grid" }, "childElements": 12, "containsRefs": [] },
{ "level": 3, "tagName": "section", "attributes": { "id": "app-content" }, "childElements": 3, "containsRefs": [] }
]
}
3.3 inspect_pattern(ref, level)
Climbs level ancestors from the target to identify a container, then returns that container's children. This makes repeating patterns at that depth visible—useful for spotting lists, grids, cards, or table rows.
If level walks past the root (up to document.body), the call returns null.
The response also includes a targetSiblingIndex (which direct child of the container contains the target) and an intentionally shallow and token-cheap outline for each sibling (e.g., top-level headings/buttons/testids), which helps the model choose the right level for the next step in the process.
Example output:
{
"ancestorLevel": 1,
"containerAt": {
"tagName": "div",
"attributes": { "data-testid": "product-grid" }
},
"targetSiblingIndex": 1,
"siblings": [
{
"index": 0,
"tagName": "div",
"attributes": { "data-testid": "product-card" },
"containsText": ["iPhone 15 Pro", "$999", "Add to Cart"],
"outline": [
{ "tag": "h3", "text": "iPhone 15 Pro" },
{ "role": "button", "text": "Add to Cart" }
]
},
{
"index": 1,
"tagName": "div",
"attributes": { "data-testid": "product-card" },
"containsText": ["MacBook Pro", "$1,999", "Add to Cart"],
"outline": [
{ "tag": "h3", "text": "MacBook Pro" },
{ "role": "button", "text": "Add to Cart" }
]
}
]
}
3.4 extract_anchors(ref, level)
Performs a bounded deep scan at the chosen level to surface deeper headings, labels, role/name-bearing nodes, and other unique anchors when the shallow outline from inspect_pattern isn't sufficient.
- Headings (h1-h6)
- Labels
- Meaningful text nodes
- Interactive elements with names
- Elements with data-testid
Example output:
{
"descendants": [
{"tag": "h3", "text": "iPhone 15 Pro", "depth": 1},
{"tag": "button", "attrs": {"aria-label": "Add to Cart"}, "depth": 3},
{"tag": "span", "attrs": {"data-testid": "price"}, "text": "$999", "depth": 2}
]
}
3.5 How the LLM Uses These Primitives: Progressive Disclosure in Action
Here's how progressive disclosure works in practice. The LLM starts with minimal information and requests more only when needed:
Disclosure Level 1: Initial Snapshot
LLM calls browser_snapshot which contains a target element that we want to container scope. In this case we want to scope the "Add to Cart" button for the iPhone 15 Pro specifically:
navigation
link "Home" [ref=e1]
link "Products" [ref=e2]
main
heading "Featured Products"
generic
heading "iPhone 15 Pro"
button "Add to Cart" [ref=e3] // <-- Target Button
generic
heading "MacBook Pro"
button "Add to Cart" [ref=e4]
At this point, the LLM sees there are multiple "Add to Cart" buttons but doesn't know their DOM structure. The accessibility tree shows they're in separate generic containers, but that could represent any of dozens of actual HTML structures. The LLM needs more information—but only about the specific button it cares about.
Disclosure Level 2: Container Discovery
The LLM calls resolve_container(e3) to reveal the structural containers around the target button:
{
"target": { "ref": "e3", "tagName": "button", "text": "Add to Cart" },
"ancestors": [
{ "level": 1, "tagName": "div", "attributes": { "data-testid": "product-card" } },
{ "level": 2, "tagName": "div", "attributes": { "data-testid": "product-grid" } },
{ "level": 3, "tagName": "section", "attributes": { "class": "products" } }
]
}
Now the LLM knows the button lives inside nested divs with data-testid="product-card" at level 1 and data-testid="product-grid" at level 2. But it doesn't know if this is one card or many. Are there other product cards? The LLM needs to check siblings.
Disclosure Level 3: Pattern Recognition
The LLM calls inspect_pattern(e3, 2) to see if there are multiple cards at the grid level:
{
"ancestorLevel": 2,
"containerAt": {
"tagName": "div",
"attributes": { "data-testid": "product-grid" }
},
"targetSiblingIndex": 1,
"siblings": [
{
"index": 0,
"tagName": "div",
"attributes": { "data-testid": "product-card" },
"containsText": ["MacBook Air", "$1,499", "Add to Cart"],
"outline": [
{ "tag": "h3", "text": "MacBook Air" },
{ "role": "button", "text": "Add to Cart" }
]
},
{
"index": 1,
"tagName": "div",
"attributes": { "data-testid": "product-card" },
"containsText": ["iPhone 15 Pro", "$999", "Add to Cart"],
"outline": [
{ "tag": "h3", "text": "iPhone 15 Pro" },
{ "role": "button", "text": "Add to Cart" }
]
}
// ... ~10 more cards
]
}
Confirmed: this is a repeating grid of product cards. The siblings array shows ~12 cards, each with a heading and an "Add to Cart" button. But which card has the iPhone 15 Pro? The shallow outline shows "iPhone 15 Pro" text exists in the target's card, but the LLM needs to verify it can use this as a unique anchor.
Disclosure Level 4: Unique Identification
The LLM calls extract_anchors(e3, 1) to mine unique anchors within the target's card:
{
"ancestorAt": {
"level": 1,
"tagName": "div",
"attributes": { "data-testid": "product-card" }
},
"descendants": [
{
"depth": 1,
"index": 0,
"tagName": "h3",
"attributes": {},
"fullText": "iPhone 15 Pro"
},
{
"depth": 1,
"index": 1,
"tagName": "span",
"attributes": { "class": "price" },
"directText": "$999"
},
{
"depth": 1,
"index": 2,
"tagName": "button",
"attributes": {},
"fullText": "Add to Cart"
}
],
}
Perfect. The LLM now has exactly what it needs:
- A stable container:
data-testid="product-card"(from ancestors) - A unique identifier: heading with text "iPhone 15 Pro" (from descendants)
- The target action: button with role and name "Add to Cart" (from initial snapshot)
Total token cost: ~2,100 tokens** instead of 50,000+ for a full DOM dump.
Workflow Complete: Generate Scoped Selector
With this worfklow complete, the LLM generates:
getByTestId("product-card")
.filter({ hasText: "iPhone 15 Pro" })
.getByRole("button", { name: "Add to Cart" })
Why progressive disclosure made this possible:
- Focused exploration: Each query targeted a specific structural question (containers? repeating? unique anchors?) instead of processing everything upfront
- Token efficiency: ~2k tokens of high-signal structural facts vs 50,000+ tokens of noisy HTML
- Compositional reasoning: The LLM built up understanding incrementally, making decisions about what to explore next based on what it learned
- Deterministic output: Each primitive returns raw structural facts, not interpretations, so the LLM can compose them contextually
This workflow works because each disclosure level revealed just enough information for the LLM to decide what question to ask next—not so much that it got lost in noise, but not so little that it had to guess.
3.6 Why Progressive Disclosure Matters for Agent Tooling
Progressive disclosure isn't just about token efficiency—it's about cognitive architecture for LLMs.
Information Overload Degrades Performance
When given massive context windows, LLMs exhibit known attention degradation effects—important details buried in noise get ignored or misinterpreted.
The latest research shows "performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts."
In testing Verdex, I found consistent patterns:
Full DOM dumps (50k+ tokens) created systematic problems: LLMs would hallucinate elements that didn't exist, selectors referenced unrelated page sections, the model would "anchor" on early DOM sections and miss better containers later, and accuracy degraded even with strong/expensive models.
Progressive exploration (0.5-3k tokens per step) eliminated most of these issues. LLMs maintained focus on relevant structural facts, selectors consistently used the most stable available anchors, the model's reasoning was more coherent—each step built logically on the previous one—and weaker models performed significantly better with progressive disclosure than they did with full dumps.
This created an interesting challenge: the tools provided the right structural data, but LLMs still struggled to compose them correctly. Models would call resolve_container but ignore the data-testid in the response, skip inspect_pattern and generate positional selectors, or call tools in the wrong order.
The problem wasn't the tools—it was compositional knowledge. LLMs needed to learn the 3-step workflow, how to interpret responses, and why certain selector patterns are stable. This led to exploring AI assistant configuration as Layer 2 of the progressive disclosure architecture—a topic covered in depth in the "Teaching LLMs to Compose" section below.
Composability Enables Generalization
By breaking DOM exploration into primitive operations, the LLM can adapt patterns to new scenarios: The same container → pattern → anchor sequence works across radically different UI structures. The LLM doesn't need to see examples of every possible DOM layout—it learns the exploration pattern itself.
This also enables incremental debugging. When a selector fails, the LLM can replay just the exploration steps that matter. Instead of "let me dump the entire DOM again," it can ask "let me check the ancestors of this element to see if the container changed."
Few-shot demonstrations of the exploration pattern transfer better than monolithic "here's everything" examples. I can show the LLM how to explore a product card, and it generalizes to user cards, settings panels, and data tables without additional examples.
3.7 Works Without Test IDs or Class Names
A common concern is: "What if my app doesn't have data-testids or semantic class names?" Verdex's primitives work on pure HTML. The system relies on the DOM's inherent structure (parent/child relationships, tag names, positions) and visible text. Test IDs and class names improve stability, but they're not required.
Example: classless markup
<section>
<div>
<h3>iPhone 15 Pro</h3>
<span>$999</span>
<button>Add to Cart</button>
</div>
<div>
<h3>MacBook Pro</h3>
<span>$1,999</span>
<button>Add to Cart</button>
</div>
</section>
Step 1 — resolve_container
resolve_container("e3") // e3 = the "Add to Cart" button
{
"target": { "ref": "e3", "tagName": "button", "text": "Add to Cart" },
"ancestors": [
{ "level": 1, "tagName": "div", "attributes": {}, "childElements": 3, "containsRefs": [] },
{ "level": 2, "tagName": "section", "attributes": {}, "childElements": 2, "containsRefs": [] }
]
}
Step 2 — inspect_pattern at the section level
inspect_pattern("e3", 2)
{
"ancestorLevel": 2,
"containerAt": { "tagName": "section", "attributes": {} },
"targetSiblingIndex": 1,
"siblings": [
{
"index": 0,
"tagName": "div",
"attributes": {},
"containsText": ["iPhone 15 Pro", "$999", "Add to Cart"],
"outline": [
{ "tag": "h3", "text": "iPhone 15 Pro" },
{ "role": "button", "text": "Add to Cart" }
]
},
{
"index": 1,
"tagName": "div",
"attributes": {},
"containsText": ["MacBook Pro", "$1,999", "Add to Cart"],
"outline": [
{ "tag": "h3", "text": "MacBook Pro" },
{ "role": "button", "text": "Add to Cart" }
]
}
]
}
Step 3 — generate a scoped selector
page.locator("section > div")
.filter({ hasText: "MacBook Pro" })
.getByRole("button", { name: "Add to Cart" })
Why this works:
- Hierarchy: ancestors reveal the unit container (here,
section > div). - Position:
targetSiblingIndexshows which sibling holds the target. - Content:
outline/containsTextprovide unique anchors ("MacBook Pro"). - Repetition: sibling listing confirms the repeating structure.
Trade-off: structure-only based selectors are less resilient than using data-testids or class names. But even with zero testing infrastructure, Verdex still enables AI agents to write deterministic, container-scoped selectors.
Part 4: Teaching LLMs to Compose
4.1 The Missing Piece
Progressive disclosure solved the DOM data problem—but the tools alone weren't enough.
When I first released Verdex, I expected LLMs to naturally discover the 3-step workflow: resolve_container → find containers, inspect_pattern → check patterns, extract_anchors → mine unique content. After all, the tool descriptions explain what each does.
They didn't.
Strong models like Claude Sonnet would:
- Call
resolve_containerbut ignore thedata-testidin the response - Skip
inspect_patternentirely and generate positional.nth()selectors - Call tools in the wrong order or with invalid parameters
- Generate technically correct but brittle selectors
The tools provided the capability to understand structure. What was missing was the knowledge to compose them correctly.
4.2 The Compositional Knowledge Gap
LLMs needed to learn:
- When to use each primitive (the 3-step sequence)
- How to interpret the responses (what structural facts matter)
- Why certain patterns are stable (role-first, container-scoped)
- What to do with edge cases (no test IDs, flaky selectors, dynamic content)
This isn't something tool descriptions can teach. Tool descriptions say "what" and "returns what." They don't teach workflows, best practices, or compositional patterns.
4.3 The Cursor Rules Approach
My first solution was comprehensive cursor rules: a 400-line guide teaching the workflow, best practices, selector patterns, and debugging strategies.
This approach works well and many teams successfully use similar patterns today. It provides clear, consistent guidance and ensures the LLM has access to all necessary knowledge upfront.
However, I noticed some practical limitations in my workflow:
- Token overhead: 12k tokens loaded in every conversation, regardless of task complexity
- Lost in the middle: With large context windows, relevant instructions could be overlooked when buried in the middle of extensive documentation
- No progressive revelation: The entire guide loaded at once, even for simple queries that only needed a subset of the guidance
- Session-dependent: Context needed to be re-established across different tools and sessions
The cursor rules approach is solid and functional—it successfully taught LLMs the exploration workflow and delivered good results. However, I recognized an opportunity for improvement: I'd built progressive discovery for DOM exploration, but was using static, upfront knowledge delivery for teaching LLMs how to use it.
This led me to explore whether AI assistant configuration could achieve similar teaching outcomes while addressing the token efficiency and context window challenges I'd observed.
4.4 AI Assistant Configuration: Progressive Disclosure for Knowledge
Modern AI assistants support various configuration mechanisms—cursor rules for Cursor IDE, Claude skills for Claude-based tools, or similar approaches for other platforms. These can implement the same progressive disclosure pattern Verdex uses for DOM exploration.
Level 1: Metadata (~100 tokens, always loaded)
---
name: verdex-playwright-authoring
description: Write robust, container-scoped Playwright selectors using
progressive DOM exploration with Verdex MCP tools (resolve_container,
inspect_pattern, extract_anchors). Use when authoring Playwright tests,
creating selectors, exploring page structure, or debugging test failures.
---
The AI assistant knows the configuration exists and when to trigger it. Token cost: ~100 tokens (negligible).
Level 2: Instructions (~5k tokens, loaded when triggered)
The main configuration file contains:
- The 3-step exploration workflow (containers → patterns → anchors)
- Selector composition patterns (test IDs → roles → content filters → structure)
- Best practices (container-scoped, not positional)
- Complete example: product card exploration → stable selector generation
Only loaded when the user asks about selectors or Playwright tests.
Level 3: Resources (~15k tokens each, loaded as needed)
Additional reference files the AI can consult when specific guidance is needed:
-
playwright-patterns.md- Test structure, assertions, authentication patterns (15k tokens) -
selector-writing.md- Container → Content → Role pattern, avoiding anti-patterns (15k tokens) -
workflow-discovery.md- Page exploration techniques, workflow mapping (18k tokens)
These are only loaded when the AI determines they're relevant to the specific query—"How do I debug a flaky selector?" triggers the debugging section, but a simple selector request doesn't.
4.5 The Parallel Architecture
This creates symmetry:
| Layer | Mechanism | Token Cost | What It Reveals |
|---|---|---|---|
| DOM Exploration |
resolve_container → inspect_pattern → extract_anchors
|
3k per exploration | Structural facts: containers, patterns, unique anchors |
| Knowledge Delivery | Metadata → Instructions → Resources | 100 → 5k → 15k+ | Workflow, composition patterns, debugging strategies |
Both use progressive disclosure to prevent information overload. Both reveal information only when relevant. Both enable token-efficient iteration.
The tools give LLMs the capability to understand DOM structure.
AI assistant configuration gives LLMs the knowledge to compose it into stable selectors.
4.6 The Workflow with Configuration
Before (without configuration):
User: "Help me click the Add to Cart button for iPhone 15 Pro"
AI: *has no context about Verdex workflow*
User: *manually explains 3-step process or pastes 400-line cursor rules*
AI: *attempts exploration but misinterprets responses*
User: *corrects approach, explains why container scoping matters*
AI: *generates selector on second or third attempt*
After (with configuration):
User: "Help me click the Add to Cart button for iPhone 15 Pro"
AI: *configuration triggers → loads relevant instructions (~5k tokens)*
AI: "Let me explore the structure progressively..."
→ resolve_container(ref) → finds data-testid="product-card"
→ inspect_pattern(ref, 2) → sees 12 similar cards
→ extract_anchors(ref, 1) → finds "iPhone 15 Pro" heading
AI: "Here's a stable, container-scoped selector:"
page.getByTestId("product-card")
.filter({ hasText: "iPhone 15 Pro" })
.getByRole("button", { name: "Add to Cart" })
No manual guidance needed. The configuration teaches the composition pattern automatically when relevant.
4.7 Token Efficiency Comparison
| Approach | Standing Cost | Per Query | Quality |
|---|---|---|---|
| No guidance | 0 tokens | 0 tokens | ❌ Generates .nth(8) brittle selectors |
| Monolithic rules | 12k tokens | 12k tokens | ⚠️ Works but always loaded |
| Progressive configuration | 100 tokens | 100-5k tokens | ✅ Efficient, automatic, progressive |
| Full DOM dump | 0 tokens | 50k+ tokens | ❌ Information overload |
Progressive configuration reduces standing token cost from 12k to 100 while dramatically improving selector quality.
4.8 Why This Matters: Verdex as a Complete Solution
Verdex is trying to be a complete solution:
- MCP Tools provide raw structural data efficiently (progressive DOM disclosure)
- AI assistant configuration provides compositional knowledge efficiently (progressive knowledge disclosure)
- Together they enable LLMs to generate stable, role-first, container-scoped Playwright selectors
Without configuration, you'd need to:
- Manually explain the workflow in every conversation
- Hope the LLM remembers best practices
- Debug why it generated positional selectors
- Re-teach patterns across sessions
With proper configuration, the knowledge is:
- Automatic: Triggered by relevant queries
- Progressive: Only loads what's needed
- Portable: Works across different AI assistants with appropriate setup
- Versioned: Update the configuration, everyone gets improvements
Part 5: Implementation Deep Dive
5.1 Technical Architecture Overview
The previous sections explained what Verdex does and why it works. This section explains how it's actually built.
Having covered how Verdex works conceptually—the progressive disclosure architecture, the three exploration primitives, and the AI configuration teaching layer—let's examine the technical implementation that makes this possible.
5.2 Technical Architecture: CDP, Puppeteer, and Isolated Worlds
Verdex is built on three core architectural choices:
1. CDP-first (Chrome-only) via Puppeteer
Verdex is a development-time, CDP-first helper—not a test runner. Puppeteer provides direct access to Chrome DevTools Protocol primitives (e.g., Page.createIsolatedWorld, Runtime.callFunctionOn, frame lifecycle hooks). This keeps the implementation straightforward and the mental model simple—building on a CDP client (Puppeteer) rather than a multi-browser framework (Playwright) reduces unnecessary abstraction layers and keeps the codebase focused.
Key CDP Benefits:
Isolated JavaScript Execution — Bridge code runs in a separate execution context invisible to application scripts, preventing interference from the page's own JavaScript (e.g., overwritten globals, monkey-patched DOM APIs, or aggressive framework behavior). This isolation ensures DOM analysis remains reliable regardless of how the application manipulates its environment.
In-Browser Structural Analysis — Complex queries (ancestor chains, sibling patterns, descendant traversal) execute entirely within the browser's JavaScript engine in a single execution context. The bridge performs multi-step DOM analysis and returns only minimal structural facts, eliminating the protocol round-trips required when orchestrating similar analysis through multiple CDP commands. This architecture also ensures all analysis reflects a consistent DOM state, preventing race conditions from dynamic content updates.
Navigation Persistence — The bridge survives page navigations through CDP's frame lifecycle hooks (
Page.frameNavigated), maintaining analysis capabilities across single-page app transitions without manual re-injection. This enables LLM workflows that incrementally explore structure across multiple pages—the agent can navigate, snapshot, explore ancestors, navigate again—without losing the analysis infrastructure.Controlled Serialization Boundary — Only final structural facts (container attributes, sibling patterns, unique anchors) cross the CDP protocol boundary. Instead of retrieving entire DOM subtrees and filtering in Node.js, the bridge performs targeted in-browser analysis and returns only what's needed for selector construction.
Creating an isolated world:
// Create an isolated world
const { executionContextId } = await client.send('Page.createIsolatedWorld', {
frameId: mainFrameId,
worldName: 'verdex-bridge',
grantUniveralAccess: false // Note: CDP API uses this spelling
});
// Code in this world:
// - Can access the DOM
// - Cannot be seen by page scripts
// - Re-injected on navigation via frame lifecycle events
2. Role Isolation via Browser Contexts
Each role (e.g., admin/user/customer) runs in its own incognito browser context, separating cookies, localStorage, and sessionStorage to prevent auth/data leakage between roles:
verdex-mcp \
--role admin ./auth/admin.json https://admin.example.com \
--role user ./auth/user.json https://app.example.com
The LLM switches between roles explicitly:
select_role("admin")
browser_navigate("/promotions/new")
browser_click("e1") // Creates promotion
select_role("user")
browser_navigate("/products")
// User sees the new promotion
The server handles context switching, session persistence, and storage isolation automatically. Each role maintains its own cookies, localStorage, and sessionStorage without leakage.
Browser contexts provide storage/session separation, while isolated worlds keep analysis code invisible to app scripts. Together, they eliminate session leakage and enable deterministic multi-role E2E workflows.
3. Custom Accessibility Snapshot with Structural Mapping
Instead of using CDP's built-in Accessibility.getFullAXTree API, Verdex injects a custom snapshot generator into the isolated execution context. This generator walks the DOM following W3C ARIA specifications and produces a semantic view similar to Playwright MCP—with stable reference IDs (e1, e2, ...) that map directly to DOM nodes.
Why custom instead of CDP's accessibility API? CDP's accessibility tree omits structural metadata that test authors need: data-testid attributes, class names, and id values aren't included unless they're ARIA-referenced. Multiple round-trip calls would be required for structural queries.
Verdex's custom generator captures both the semantic accessibility view and stores a Map<string, ElementInfo> that indexes accessibility references to DOM nodes with full attribute metadata:
// During snapshot generation
const ref = `e${++this.bridge.counter}`;
ariaNode.ref = ref;
this.bridge.elements.set(ref, {
element: element, // Live DOM node reference
tagName: element.tagName,
role: role,
name: name,
attributes: this.getAttributes(element)
});
// Later: resolve_container(ref) walks the stored element's parent chain
// No DOM search, no re-parsing—just direct traversal from the mapped node
The key difference from Playwright MCP:
- Playwright MCP: Uses the mapping internally for aria-ref selector resolution—converting refs to elements for interactions (click, type, evaluate)
-
Verdex: Exposes the mapping through structural exploration tools—
resolve_containerwalks up from stored DOM nodes,inspect_patternexamines container children,extract_anchorstraverses subtrees
The Verdex design enables LLMs to reason about DOM hierarchy and write container-scoped selectors.
Architecture Diagram
5.3 Bridge Architecture
The injected bridge code provides specialized analysis modules that run entirely within the isolated world:
- DOMAnalyzer — Efficient ancestor/sibling/descendant traversal with configurable depth limits
- StructuralAnalyzer — Pattern detection and stable anchor identification
- SnapshotGenerator — Custom accessibility tree builder following W3C ARIA specs with ref-to-DOM mapping
- AriaUtils — Role computation and semantic element mapping
These modules perform DOM analysis synchronously without protocol round-trips. Only final results cross the CDP boundary:
// Server side - single CDP call returns complete analysis
const { result } = await cdpSession.send('Runtime.callFunctionOn', {
functionDeclaration: 'function(ref) { return this.resolve_container(ref); }',
objectId: bridgeObjectId,
arguments: [{ value: ref }],
returnByValue: true
});
// Bridge side (in isolated world) - performs full traversal in-browser
window.verdexBridge = {
resolve_container(ref) {
// Synchronous DOM traversal
// Filters for stable containers
// Returns minimal JSON
}
};
This architecture keeps computation close to the data (DOM), minimizes serialization costs, and gives complete control over the token/accuracy tradeoff for each analysis primitive. The bridge code is bundled at build time using esbuild and auto-injected via CDP's Page.addScriptToEvaluateOnNewDocument, persisting across navigations without re-injection overhead.
The bridge architecture took several iterations to get right. If you're curious about the journey from toString() injection to the current pre-bundled approach, I wrote about it here: Why I Ditched toString() and Moved to a Pre-Bundled Bridge with Event-Driven CDP Injection
5.4 LLM-Facing API Design Principles
Early when I started experimenting, I tried to make these tools be "helpful" to the LLM by including an interpretation layer:
// BAD: Tool tries to interpret
{
"type": "product-card", // Tool guesses this is a product card
"role": "list-item", // Tool guesses it's in a list
"confidence": 0.85 // Tool guesses confidence
}
This didn't generalize:
1. Heuristics are both brittle and context-dependent, and when a result is delivered it is hard for models to override
- Is a
div.carda product card, a user card, or a settings panel card? - The tool can't know without understanding the application domain
- The LLM can know by considering the user's query and surrounding context
2. Different tasks need different interpretations
- Authoring: "Find stable anchors for a selector"
- Debugging: "Why did this selector break?"
- Refactoring: "What other tests use similar patterns?"
The same DOM structure means different things for different tasks.
3. Tool behavior became unpredictable
- Changing the heuristics to fix one case broke others
- Adding confidence scores didn't help - because it's just a guess with a number next to it
- Prompt engineering couldn't override bad tool decisions
The Solution: return raw structural facts (tags/attrs/positions) and let the model compose them per task
I stripped all interpretation from the tools. They now return only raw structural facts:
// GOOD: Just facts
{
"tag": "div",
"attrs": {"data-testid": "product-card", "class": "card border-gray"},
"depth": 2
}
Core principle: Keep capability in tools, strategy in configuration. Tools provide deterministic building blocks; the LLM (guided by configuration) composes them contextually. This separation means:
- Tool mistakes become model mistakes (which the model can self-correct)
- Different tasks can interpret the same facts differently
- Heuristics stay overridable rather than baked into outputs
The LLM decides what this means based on:
- The user's query
- The task context
- Surrounding DOM structure
- Application-specific knowledge
Example: Same tool output, different interpretations
Task: Authoring a selector
LLM reasoning: "data-testid='product-card' is a stable anchor.
I should use getByTestId() to scope my selector."
Task: Debugging why a test broke
LLM reasoning: "The test used data-testid='product-card' but now
there are 12 elements with that ID. The developer probably copied
the component without updating testids. I should suggest filtering
by unique content or updating the testid."
Trade-off: Raw structural facts require compositional knowledge—LLMs need to learn when to call each primitive and how to interpret the results. This is where AI assistant configuration (covered earlier in "Teaching LLMs to Compose") becomes essential. Configuration files progressively teach the 3-step workflow (containers → patterns → anchors), selector composition patterns, and best practices—loading instructions only when triggered (~5k tokens) instead of requiring 12k+ token monolithic rules upfront. Without proper configuration, even strong models can struggle with correct tool composition.
Part 6: Lessons and Open Questions
6.1 Lessons Learned
LLM-Facing APIs Have Different Constraints Than Human-Facing APIs
LLM-facing APIs benefit from predictable, low-level primitives over 'helpful' abstractions. Deterministic building blocks compose better across unfamiliar UIs and edge cases.
Consider a simple example: humans appreciate a method like page.selectDropdown("Country", "United States") because it handles all the fiddly details. But an LLM actually does better with the raw primitives: page.click('select[name="country"]') followed by page.click('option:has-text("United States")'). The LLM can adapt this low-level pattern to dropdowns it's never encountered before. The high-level abstraction only works for exactly the cases it was designed to handle.
Deterministic Primitives Beat Opinionated Heuristics
Every single time I added "helpful" interpretation to a tool, the same pattern emerged. It worked for the test cases I built it for. Then it broke in production on edge cases I hadn't anticipated. And critically, the LLM had no way to override the bad interpretation because the tool had already made the decision.
Raw structural facts, on the other hand, stayed useful across every scenario I tested it on. The LLM could interpret them contextually based on the user's actual query, the application domain, and the specific task at hand. A div with a data-testid is just a div with a data-testid. What that means depends entirely on context, and the LLM is far better equipped to understand that context than any heuristic I could build into the tools.
Token Efficiency Enables Iteration
Making tools cheap to call fundamentally changed how I used them. The difference isn't just about the direct cost, though that matters. It's about the psychological shift. When each query costs fifty cents and takes fifteen seconds, you plan carefully. You hesitate. You make fewer exploratory calls. You write more conservative prompts because iteration is expensive.
When queries cost a penny and return in three seconds, everything changes. You explore freely. You try things. You discover patterns you'd never have found through careful planning. This exploration led directly to better prompts, which led to better results, which led to discovering even better patterns. The feedback loop between cheap tools and rapid iteration is very real.
Unlocking Complex Multi-User E2E Test Flows
One area specifically that unlocked for me was writing complex multi-user e2e test flows.
Because I was working on an app that was a marketplace - there are three specific roles that needed to interact with each other: an admin (the application admin), a provider (the sell side of the app marketplace), and a customer (who buys from the provider).
The CTO wanted testing that maintained certain flows being tested - the provider adds a product with a discount, the customer loads the product, then the provider changes details about the product, and those details are reflected in the customer session, and then the customer checks out - as one simple example.
Browser Isolation Is Essential for Multi-Role Workflows
Multi-role test scenarios—where you need to verify interactions between different user types—are extremely difficult to author manually. Managing multiple authenticated sessions, preventing cookie leakage, and tracking which context you're in creates substantial cognitive overhead.
Browser isolation is essential for multi-role workflows. Incognito browser contexts isolate storage/auth per role, while isolated worlds keep analysis code invisible to app scripts. Together, they reduce session leakage and flakiness in complex flows. Each role gets a dedicated browser context with proper session isolation, and the LLM just references roles by name without managing any of the underlying complexity.
The impact is larger than the feature itself: multi-role e2e tests go from being tedious and error-prone to straightforward. Tests that would take an hour to write manually now take minutes with LLM assistance, and they're more reliable because the isolation is handled at the browser protocol level.
6.2 Limitations and Trade-offs
This isn't a test runner and deliberately doesn't try to be one. It's tooling for authoring tests during development. You still execute your tests normally in Playwright. Trying to replace Playwright would make zero sense given how much work has gone into making it robust across browsers and platforms. The tool stays focused on its specific job: helping coding agents generate better Playwright code.
Initially, the system required strong LLMs to work well—weaker models struggled to chain the tool calls correctly. The introduction of proper AI assistant configuration significantly improves this. Configuration files teach the 3-step workflow and composition patterns, which helps models use the tools more effectively. That said, very weak models may still struggle, and the approach works best with Claude Sonnet 4+ or similar capability levels. I'm curious whether future model improvements or better configuration design will further lower this bar.
Token efficiency matters most in high-iteration workflows. If you're writing tests once and never touching them again, the difference between 1,200 tokens and 50,000 tokens per query is less significant. The tool is optimized for iterative refinement during development—exploring structure, debugging selectors, refining patterns—where token costs compound quickly. Progressive configuration amplifies this benefit: the ~100 token standing cost means you can have dozens of conversations without the 12k token overhead of monolithic rules.
The system handles iframes through a lazy expansion approach: iframes appear as markers in the main frame snapshot, then are recursively expanded with frame-qualified refs (e.g., f1_e3 for elements in the first iframe). This enables interactions with embedded content like Stripe checkout forms while gracefully handling cross-origin restrictions.
Finally, the tool provides limited action primitives beyond DOM exploration. It's focused on helping generate selectors and understand page structure, not on replicating Playwright's full action API. For complex interactions like drag-and-drop or hover sequences, these are things that can be developed on top of what I have built so far.
6.3 Open Questions
I'm confident the core approach—incremental exploration with deterministic primitives—is sound. But several specific design decisions deserve more debate.
Primitive decomposition: Containers, patterns, and anchors emerged as the right primitives for DOM exploration through iterative testing, but I don't have strong theoretical justification for why these are optimal. They map cleanly to the structural questions LLMs need to answer ("what contains this?", "does this pattern repeat?", "what's unique here?"), and they compose well for progressive disclosure. But are these actually the fundamental primitives, or just the ones I discovered first?
Token budget allocation: The 3k token budget I converged on came from trial and error. Is this actually optimal? Would doubling the budget for descendants improve accuracy enough to justify the cost? Or would it just give the LLM more irrelevant context to sift through? I'd need to run hundreds of test cases with different budget allocations to answer this properly.
Generalization beyond browsers: The pattern of "give minimal structural facts, let the LLM compose them" should work for other domains, but I don't know if it actually does. Database schema exploration, filesystem navigation, API discovery—these all have similar structural properties. Has anyone built comparable tooling? Did the same design principles hold, or did domain-specific constraints force different trade-offs?
Teaching composition: Right now I rely on few-shot examples and hope the model generalizes. This works with Sonnet 4+, but it's not a robust training approach. Can we systematically improve how models learn to chain these primitives? Would synthetic data generation help—creating thousands of DOM structures with corresponding optimal selector paths? Or is this just a scaling problem that disappears as models get stronger?
The most valuable thing would be seeing this approach tested on significantly different applications—particularly legacy codebases without test IDs, heavily dynamic SPAs, or applications with hostile DOM structures. That's where the design decisions would really get stress-tested.
6.4 Feedback from Early Adopters
I'd love to hear about:
- Your DOM structure: Legacy/modern? Test IDs? Accessibility patterns?
- Success rate: What % of selectors work without manual refinement?
- Token costs: Typical exploration sequences and total token usage
- Breaking points: Where did the approach fail or require fallbacks?
- Comparison: How does this compare to your current workflow?
Code and Architecture
The full implementation is available on GitHub at https://github.com/verdexhq/verdex-mcp. The codebase is TypeScript throughout, using Puppeteer for CDP access and the MCP SDK for the tool protocol. The bridge architecture is modular and designed to be extensible if you want to add new analysis capabilities.
This is an experimental project. I built it to solve specific pain points in my workflow around AI-assisted test authoring. It might be useful to others with similar workflows, or it might just be an interesting data point in the broader design space of LLM tooling. Either way, I learned a tremendous amount building it, and I'm curious what others think about these architectural trade-offs and design decisions.

Top comments (0)