TL;DR: Verdex initially treated iframes as invisible—snapshots captured the <iframe> tag but none of the content inside. This made Stripe checkouts, PayPal buttons, and embedded widgets completely opaque to the system. I rebuilt the bridge injection layer to support per-frame isolated worlds with lazy snapshot expansion, using Playwright's event-driven patterns. The result: nested iframe content appears in snapshots with frame-qualified refs (f1_e3), interactions route to the correct frame automatically, and the entire MCP API stayed identical.
GitHub: https://github.com/verdexhq/verdex-mcp
The Problem: When Your Test Tool Can't See the Payment Form
The first time I tried to use Verdex on a real checkout flow, I got a snapshot like this:
- button "Add to Cart" [ref=e1]
- iframe [ref=e2]
- button "Continue Shopping" [ref=e3]
The Stripe payment form was inside ref=e2. The snapshot showed the iframe element itself, but the actual payment fields—card number, expiration, CVC—were invisible. The LLM could see there was an iframe, but had no way to interact with anything inside it.
I tried click(ref=e2). It clicked the iframe element, which did nothing. The actual payment button was nested three layers deep: main page → iframe → shadow DOM → button. Verdex's bridge lived in the main page's isolated world. It had no visibility into child frames, no way to snapshot their content, and no way to route interactions.
The problem wasn't just payment forms. Google's OAuth prompts, embedded maps, chat widgets, video players—anything using iframes was a black box.
What I Built Instead
I needed per-frame bridge instances. Each frame (main and all iframes) gets its own isolated world with its own bridge. Snapshots recursively expand iframe markers by snapshotting child frames and merging the output with frame-qualified refs. Interactions parse the ref, look up the target frame, and route the method call there.
The architecture is lazy: bridges aren't injected into iframes until the first snapshot that needs them. This avoids overhead on pages with many hidden iframes (analytics trackers, ad pixels) while still handling interactive content.
ManualPromise: Playwright's Event-Driven Pattern
The core challenge was knowing when a frame's execution context was ready. The old approach would've been polling with retries:
// ❌ Polling approach (what I didn't do)
async ensureFrameReady(frameId: string) {
for (let i = 0; i < 10; i++) {
try {
await cdp.send('Runtime.evaluate', {
expression: '1 + 1',
contextId: guessedContextId
});
return; // It worked!
} catch {
await sleep(50 * Math.pow(2, i)); // Exponential backoff
}
}
throw new Error('Frame never became ready');
}
That's brittle. Navigation timing varies. Arbitrary delays cause false timeouts. It's not deterministic.
Instead, I adopted Playwright's ManualPromise pattern:
export class ManualPromise<T = void> extends Promise<T> {
private _resolve!: (value: T) => void;
private _reject!: (error: Error) => void;
private _isDone = false;
constructor() {
let resolve: (value: T) => void;
let reject: (error: Error) => void;
super((f, r) => {
resolve = f;
reject = r;
});
this._resolve = resolve!;
this._reject = reject!;
}
resolve(value: T): void {
if (this._isDone) return;
this._isDone = true;
this._resolve(value);
}
reject(error: Error): void {
if (this._isDone) return;
this._isDone = true;
this._reject(error);
}
isDone(): boolean {
return this._isDone;
}
}
Each frame gets a contextReadyPromise that resolves when Runtime.executionContextCreated fires for its isolated world:
type FrameState = {
frameId: string;
contextId: number;
bridgeObjectId: string;
contextReadyPromise: ManualPromise<void>;
};
// When CDP tells us the context exists, resolve the promise
cdp.on('Runtime.executionContextCreated', (evt) => {
const ctx = evt.context;
const frameId = ctx.auxData?.frameId;
const matchesWorld = ctx.name === this.worldName ||
ctx.auxData.name === this.worldName;
if (matchesWorld && frameId) {
const frameState = this.getOrCreateFrameState(cdp, frameId);
frameState.contextId = ctx.id;
frameState.contextReadyPromise.resolve(); // ← Event-driven!
}
});
// Anywhere that needs a ready frame just awaits
async ensureFrameState(cdp: CDPSession, frameId: string): Promise<FrameState> {
let state = this.getFrameState(cdp, frameId);
if (state?.contextReadyPromise.isDone()) {
return state; // Already ready
}
if (state) {
await state.contextReadyPromise; // Wait for event
return state;
}
// Create new state and trigger isolated world creation
state = this.getOrCreateFrameState(cdp, frameId);
await cdp.send('Page.createIsolatedWorld', {
frameId,
worldName: this.worldName,
});
await state.contextReadyPromise; // Wait for executionContextCreated event
// Inject bundle into the now-ready context
await cdp.send('Runtime.evaluate', {
expression: BRIDGE_BUNDLE,
contextId: state.contextId,
});
return state;
}
No polling. No retries. No arbitrary delays. The browser tells us when the context exists, and we immediately proceed. If a frame detaches during injection, the frameDetached event rejects the promise—no timeouts needed.
Frame Lifecycle Tracking
I added listeners for the full frame lifecycle:
// Frame appears (lazy injection - don't inject until needed)
cdp.on('Page.frameAttached', (evt) => {
this.getOrCreateFrameState(cdp, evt.frameId);
});
// Frame disappears (reject pending promises, clean up state)
cdp.on('Page.frameDetached', (evt) => {
const state = sessionStates.get(evt.frameId);
if (state && !state.contextReadyPromise.isDone()) {
state.contextReadyPromise.reject(new FrameDetachedError(evt.frameId));
}
sessionStates.delete(evt.frameId);
});
// Same-document navigation (SPA routing)
cdp.on('Page.navigatedWithinDocument', (evt) => {
const state = sessionStates?.get(evt.frameId);
if (state) {
state.bridgeObjectId = ''; // Invalidate bridge instance, keep context
}
});
// Cross-document navigation (full page reload)
cdp.on('Page.frameNavigated', (evt) => {
if (evt.frame && !evt.frame.parentId) {
// Context will be destroyed and recreated - clear state
sessionStates.delete(evt.frame.id);
}
});
Error Tracking and Failure Logs
The implementation includes comprehensive error tracking to help debug multi-frame issues:
type FrameExpansionError = {
ref: string; // Which iframe ref failed
error: string; // Error message
detached: boolean; // Was it a detachment (expected) or other error?
timestamp?: number; // When did it happen?
};
// Snapshots include expansion errors
interface Snapshot {
text: string;
elementCount: number;
pageContext: { url: string; title: string };
expansionErrors?: Array<{ ref: string; error: string; detached: boolean }>;
warnings?: {
inaccessibleFrames?: number;
partialContent?: boolean;
details?: string[];
};
}
When frame expansion fails, the error is tracked and included in the snapshot. This gives LLMs context about why certain iframes might be missing, and helps developers debug cross-origin or timing issues.
Lazy Snapshot Expansion
At snapshot time, I recursively expand iframe markers:
async snapshot(): Promise<Snapshot> {
const context = await this.ensureCurrentRoleContext();
// Get main frame snapshot (includes iframe markers)
const mainSnapshot = await context.bridgeInjector.callBridgeMethod(
context.cdpSession,
"snapshot",
[],
context.mainFrameId
);
// Build refIndex: main frame refs go in first
const refIndex = new Map<string, RefIndexEntry>();
const mainFrameRefs = mainSnapshot.text.matchAll(/\[ref=([^\]]+)\]/g);
for (const match of mainFrameRefs) {
const ref = match[1];
refIndex.set(ref, {
frameId: context.mainFrameId,
localRef: ref
});
}
// Recursively expand iframes
const expanded = await this.expandIframes(
context,
mainSnapshot.text,
context.mainFrameId,
0, // ordinal counter
refIndex
);
// Store refIndex on context for interaction routing
context.refIndex = refIndex;
// Build snapshot with error tracking
const snapshot: Snapshot = {
text: expanded.text,
elementCount: mainSnapshot.elementCount + expanded.elementCount,
pageContext: {
url: context.page.url(),
title: await context.page.title(),
},
};
// Add expansion errors to snapshot if any occurred
if (expanded.errors.length > 0) {
snapshot.expansionErrors = expanded.errors;
}
return snapshot;
}
The expansion logic walks the snapshot text, finds iframe markers, resolves each to a child frame ID using CDP's DOM.describeNode, snapshots that frame, and merges the output:
private async expandIframes(
context: RoleContext,
snapshotText: string,
currentFrameId: string,
ordinalCounter: number,
refIndex: GlobalRefIndex
): Promise<{
text: string;
elementCount: number;
nextOrdinal: number;
errors: Array<{ ref: string; error: string; detached: boolean }>;
}> {
const lines = snapshotText.split('\n');
const result: string[] = [];
let totalElements = 0;
let nextOrdinal = ordinalCounter;
const errors: Array<{ ref: string; error: string; detached: boolean }> = [];
for (const line of lines) {
// Match: "- iframe [ref=e5]"
const match = line.match(/^(\s*)- iframe(?:\s+"[^"]*")?\s+\[ref=([^\]]+)\]/);
if (!match) {
result.push(line);
continue;
}
const indentation = match[1];
const iframeRef = match[2];
result.push(line + ':'); // Add colon to show children
try {
// Resolve iframe ref to child frame ID
const frameInfo = await this.resolveFrameFromRef(
context,
currentFrameId,
iframeRef
);
if (!frameInfo) {
result.push(indentation + ' [Frame content unavailable]');
continue;
}
const frameOrdinal = ++nextOrdinal;
// Snapshot child frame
const childSnapshot = await context.bridgeInjector.callBridgeMethod(
context.cdpSession,
"snapshot",
[],
frameInfo.frameId
);
// Recursively expand child's iframes
const expandedChild = await this.expandIframes(
context,
childSnapshot.text,
frameInfo.frameId,
nextOrdinal,
refIndex
);
nextOrdinal = expandedChild.nextOrdinal;
// Rewrite child refs: e1 → f1_e1
// Uses RefFormatter utility for consistent ref formatting
const rewritten = expandedChild.text.replace(
/\[ref=(e[^\]]+)\]/g,
(_whole, localRef) => {
// Only rewrite local refs, not already-qualified refs
if (!RefFormatter.isLocal(localRef)) {
return `[ref=${localRef}]`; // Already qualified from nested iframe
}
const globalRef = RefFormatter.toGlobal(frameOrdinal, localRef);
refIndex.set(globalRef, {
frameId: frameInfo.frameId,
localRef
});
return `[ref=${globalRef}]`;
}
);
// Indent and merge
for (const childLine of rewritten.split('\n')) {
if (childLine.trim()) {
result.push(indentation + ' ' + childLine);
}
}
} catch (error) {
// Track errors for snapshot warnings
const isDetached = this.isFrameDetachedError(error);
const errorMsg = error instanceof Error ? error.message : String(error);
errors.push({
ref: iframeRef,
error: errorMsg,
detached: isDetached,
});
// Frame detachment is normal (logged at debug level)
if (isDetached) {
console.debug(`Frame ${iframeRef} detached during expansion`);
result.push(indentation + ' [Frame detached]');
} else {
// Unexpected errors logged as warnings
console.warn(`Frame expansion error for ${iframeRef}:`, errorMsg);
result.push(indentation + ` [Error: ${errorMsg}]`);
}
}
}
return {
text: result.join('\n'),
elementCount: totalElements,
nextOrdinal,
errors // Returned for snapshot warnings
};
}
Frame Resolution via CDP
To map an iframe element to its frame ID, I use DOM.describeNode:
private async resolveFrameFromRef(
context: RoleContext,
parentFrameId: string,
iframeRef: string
): Promise<{ frameId: string } | null> {
// Get the iframe element from the parent frame's bridge
const bridgeObjectId = await context.bridgeInjector.getBridgeHandle(
context.cdpSession,
parentFrameId
);
// Get the iframe element from the parent frame's bridge
// Access bridge's elements Map directly to get the DOM element
const { result } = await context.cdpSession.send('Runtime.callFunctionOn', {
objectId: bridgeObjectId,
functionDeclaration: `function(ref) {
// Get the ElementInfo which contains the actual DOM element
const info = this.elements.get(ref);
if (!info) return null;
// Verify it's an iframe
if (info.tagName.toUpperCase() !== 'IFRAME') return null;
// Return the element itself (will have objectId)
return info.element;
}`,
arguments: [{ value: iframeRef }],
returnByValue: false, // CRITICAL: Get as remote object, not value
});
if (!result.objectId) {
console.warn(`No objectId for iframe ref ${iframeRef}`);
return null;
}
// Use CDP to get the child frameId
// pierce: true enables traversal into iframe's content document
const { node } = await context.cdpSession.send('DOM.describeNode', {
objectId: result.objectId,
pierce: true,
});
// CDP returns either node.frameId or node.contentDocument.frameId depending on version
const childFrameId = node.frameId || node.contentDocument?.frameId;
if (!childFrameId) {
console.warn(
`Element ${iframeRef} has no associated frame (might be empty or not yet loaded)`
);
return null;
}
return { frameId: childFrameId };
}
Interaction Routing
All interaction methods parse the ref to determine which frame it belongs to:
private parseRef(ref: string, context: RoleContext): {
frameId: string;
localRef: string
} {
// Check if refIndex exists (should be populated by snapshot())
if (!context.refIndex) {
throw new Error(
"No refIndex found. Take a snapshot first before interacting with elements."
);
}
// Lookup in refIndex (includes both main frame and child frame refs)
const entry = context.refIndex.get(ref);
if (entry) {
return { frameId: entry.frameId, localRef: entry.localRef };
}
// If not found, ref is stale or invalid
throw new UnknownRefError(ref);
}
async click(ref: string): Promise<void> {
const context = await this.ensureCurrentRoleContext();
const { frameId, localRef } = this.parseRef(ref, context);
await context.bridgeInjector.callBridgeMethod(
context.cdpSession,
"click",
[localRef],
frameId // ← Routes to correct frame
);
}
The same pattern applies to type(), resolve_container(), inspect_pattern(), and extract_anchors():
async type(ref: string, text: string): Promise<void> {
const context = await this.ensureCurrentRoleContext();
const { frameId, localRef } = this.parseRef(ref, context);
await context.bridgeInjector.callBridgeMethod(
context.cdpSession,
"type",
[localRef, text],
frameId // Routes to correct frame automatically
);
}
async resolve_container(ref: string): Promise<any> {
const context = await this.ensureCurrentRoleContext();
const { frameId, localRef } = this.parseRef(ref, context);
return await context.bridgeInjector.callBridgeMethod(
context.cdpSession,
"resolve_container",
[localRef],
frameId
);
}
// inspect_pattern and extract_anchors follow the same pattern
Before and After
Old snapshot (iframe is opaque):
- button "Add to Cart" [ref=e1]
- iframe [ref=e2]
- button "Continue Shopping" [ref=e3]
New snapshot (iframe content visible):
- button "Add to Cart" [ref=e1]
- iframe [ref=e2]:
- heading "Payment Information" [ref=f1_e1]
- textbox "Card Number" [ref=f1_e2]
- textbox "Expiration" [ref=f1_e3]
- textbox "CVC" [ref=f1_e4]
- button "Pay Now" [ref=f1_e5]
- button "Continue Shopping" [ref=e3]
Old interaction attempt:
await browser.click('e2'); // Clicked the iframe container (did nothing)
New interaction:
await browser.click('f1_e5'); // Clicks "Pay Now" inside the iframe
What Changed
Stripe checkout flows work. The test case I couldn't complete before—filling out payment details in an embedded iframe—now works end-to-end.
Nested iframes work. I can snapshot and interact with iframes inside iframes (tested 3 levels deep).
The refIndex provides O(1) routing. Every ref in the snapshot maps to { frameId, localRef }. No scanning, no heuristics, just a Map lookup.
Error tracking improves debugging. The implementation tracks frame expansion failures separately from injection failures, distinguishing between expected issues (frame detached) and unexpected errors. Snapshots include expansionErrors when iframes couldn't be accessed, and warnings appear in the output.
Event-driven lifecycle eliminated timing bugs. The old single-frame implementation had occasional "Cannot find execution context" errors during navigation. Those don't happen anymore because frame state tracking is event-driven, not timing-dependent.
Test suite expanded by 7 files and ~500 lines. I added comprehensive multi-frame tests:
-
frame-discovery.spec.ts- Frame enumeration after navigation -
frame-resolution.spec.ts- Mapping iframe refs to frame IDs -
iframe-snapshot-expansion.spec.ts- Snapshot recursion -
interaction-routing.spec.ts- Interaction routing to child frames -
iframe-edge-cases.spec.ts- Empty iframes, cross-origin, dynamic injection -
multi-frame-bridge.spec.ts- Per-frame bridge isolation -
iframe-visual-demo.spec.ts- Visual debugging test with nested iframes
All 188 tests still pass. The refactor touched every interaction method, but I kept the API surface identical.
The API Stayed Identical
Users calling MCP tools see no difference:
// Before multi-frame support
const snapshot = await browser.snapshot();
await browser.click('e5');
const ancestors = await browser.resolve_container('e5');
// After multi-frame support (same calls, now works with iframe refs)
const snapshot = await browser.snapshot();
await browser.click('f1_e5'); // Can target iframe content
const ancestors = await browser.resolve_container('f1_e5'); // Works inside iframes
The ref format is the only visible change: main frame refs stay as e1, e2, etc., while child frame refs become f1_e1, f2_e3, etc. That's progressive enhancement—main-frame-only pages have identical snapshots.
Frame Detachment Detection
A critical helper distinguishes between expected frame lifecycle events and real errors:
private isFrameDetachedError(error: any): boolean {
if (!error?.message) return false;
const msg = error.message.toLowerCase();
return (
msg.includes('frame detached') ||
msg.includes('frame has been detached') ||
msg.includes('execution context was destroyed') ||
msg.includes('cannot find context')
);
}
This pattern appears throughout the codebase—frame detachment during operations is expected behavior, not an error. The code logs these at debug level while logging unexpected errors as warnings.
Architecture Principles
Lazy injection. Bridges aren't created until the first snapshot that needs them. This keeps performance high on iframe-heavy pages where most iframes are hidden analytics trackers.
Automatic filtering. Only iframes in the accessibility tree get expanded. Hidden iframes (display: none, aria-hidden="true") are excluded automatically because they don't appear in the snapshot.
Graceful degradation. If one iframe fails to expand (cross-origin restrictions, frame detached during snapshot), the rest of the snapshot still works. The failure appears as [Frame content unavailable] in the output.
Event-driven everything. Frame lifecycle, execution context creation, navigation—all driven by CDP events. No polling, no retries, no arbitrary delays.
What I Learned
Playwright's patterns are production-proven. I adopted
ManualPromise, event-driven context tracking, and lazy injection - all ideas I took from Playwright. These patterns handle edge cases I wouldn't have thought of.Iframes need role assignments to be interactive. Adding
case "IFRAME": return "iframe"toAriaUtils.getImplicitRole()wasn't enough—I also had to add"IFRAME"to theINTERACTIVE_ELEMENTSarray soSnapshotGeneratorwould assign refs. Without refs, frame resolution can't work.Frame detachment is normal, not an error. Frames can detach during snapshot expansion (lazy-loaded iframes, dynamic removal). The code treats this as expected behavior, not a retry case.
CDP's
DOM.describeNodehas an ambiguity. The docs saynode.frameIdcontains "Frame ID for frame owner elements," but also returnnode.contentDocument.frameId. I implemented a fallback:node.frameId || node.contentDocument?.frameId. Empirically, both work.Same-document navigation vs cross-document navigation need different handling. SPA routing (
navigatedWithinDocument) keeps the execution context alive but invalidates the bridge instance. Full navigation (frameNavigated) destroys the context entirely.The refIndex must include main frame refs. Originally I only added child frame refs to the index, which meant
parseRef()needed a fallback for main frame refs. Adding all refs to the index simplified the code and made routing uniform.RefFormatter utility keeps ref handling consistent. Extracting ref formatting logic (
toGlobal(),isLocal(),parse()) into a utility class prevented inline string manipulation bugs and made the ref format explicit: main frame usese1, child frames usef1_e1.Error categorization matters. The code distinguishes between frame detachment (expected, logged at debug level) and unexpected errors (logged as warnings). This prevents log noise while still surfacing real issues.
Trade-Offs
Snapshot time increases with iframe depth. A page with 3 nested iframes takes ~3x longer to snapshot than main-frame-only. The expansion is serial (depth-first), not parallel. For typical pages (0-2 iframes), the overhead is negligible.
Cross-origin iframes are non-injectable. Chrome's security model prevents CDP from accessing cross-origin iframe content unless the page has permissive CORS headers. This is correct behavior—Verdex shouldn't bypass browser security. Cross-origin iframes appear as [Frame content unavailable] in snapshots.
The ref format changed. Frame-qualified refs (f1_e5) are longer than unqualified refs (e5). This adds ~5-10 tokens per iframe element to the snapshot. For typical usage (1-2 iframes with 5-10 elements each), that's ~50-100 tokens—acceptable for the functionality gained.
Memory overhead per frame. Each frame stores a FrameState with a ManualPromise, context ID, and bridge object ID. For pages with 10+ iframes, this adds ~1-2KB of memory per frame. Negligible for typical usage.
What's Next
Parallel frame expansion. Currently, iframe snapshot expansion is depth-first and serial. Switching to Promise.allSettled() for sibling iframes would reduce snapshot time on iframe-heavy pages.
Frame metadata in snapshots. Adding iframe name, title, or id attributes to the snapshot output would give LLMs better context about what each frame represents: [Frame: name="stripe-checkout", title="Payment Form"].
Performance budgets. For pages with excessive iframes (20+ ad trackers), snapshot time could become problematic. Adding configurable depth limits (maxIframeDepth: 2) would cap recursion.
Cross-origin detection. Currently, cross-origin frames fail silently with [Frame content unavailable]. Detecting cross-origin explicitly and showing [Cross-origin frame - content not accessible] would give clearer feedback.
Conclusion
Multi-frame support turned Verdex from "works on simple pages" to "works on real e-commerce sites." Stripe checkout flows, OAuth popups, embedded widgets—all now visible and interactive. The architecture is event-driven, lazy, and graceful under failure.
The refactor touched 8 files and added ~730 lines of implementation code. The test suite grew by 500 lines. All existing tests pass. The MCP API surface didn't change. Users get iframe support without migration.
The critical insight: iframe support isn't about iframes—it's about making the bridge lifecycle event-driven and multi-context-aware. Once that foundation exists, frame expansion and interaction routing are straightforward.
The debugging is cleaner. The behavior is deterministic. The architecture scales to nested iframes. And users can finally fill out Stripe forms.
Top comments (0)