DEV Community

Johnny
Johnny

Posted on

How Verdex Sees Inside Iframes: Event-Driven Multi-Frame Support

TL;DR: Verdex initially treated iframes as invisible—snapshots captured the <iframe> tag but none of the content inside. This made Stripe checkouts, PayPal buttons, and embedded widgets completely opaque to the system. I rebuilt the bridge injection layer to support per-frame isolated worlds with lazy snapshot expansion, using Playwright's event-driven patterns. The result: nested iframe content appears in snapshots with frame-qualified refs (f1_e3), interactions route to the correct frame automatically, and the entire MCP API stayed identical.

GitHub: https://github.com/verdexhq/verdex-mcp

The Problem: When Your Test Tool Can't See the Payment Form

The first time I tried to use Verdex on a real checkout flow, I got a snapshot like this:

- button "Add to Cart" [ref=e1]
- iframe [ref=e2]
- button "Continue Shopping" [ref=e3]
Enter fullscreen mode Exit fullscreen mode

The Stripe payment form was inside ref=e2. The snapshot showed the iframe element itself, but the actual payment fields—card number, expiration, CVC—were invisible. The LLM could see there was an iframe, but had no way to interact with anything inside it.

I tried click(ref=e2). It clicked the iframe element, which did nothing. The actual payment button was nested three layers deep: main page → iframe → shadow DOM → button. Verdex's bridge lived in the main page's isolated world. It had no visibility into child frames, no way to snapshot their content, and no way to route interactions.

The problem wasn't just payment forms. Google's OAuth prompts, embedded maps, chat widgets, video players—anything using iframes was a black box.

What I Built Instead

I needed per-frame bridge instances. Each frame (main and all iframes) gets its own isolated world with its own bridge. Snapshots recursively expand iframe markers by snapshotting child frames and merging the output with frame-qualified refs. Interactions parse the ref, look up the target frame, and route the method call there.

The architecture is lazy: bridges aren't injected into iframes until the first snapshot that needs them. This avoids overhead on pages with many hidden iframes (analytics trackers, ad pixels) while still handling interactive content.

ManualPromise: Playwright's Event-Driven Pattern

The core challenge was knowing when a frame's execution context was ready. The old approach would've been polling with retries:

// ❌ Polling approach (what I didn't do)
async ensureFrameReady(frameId: string) {
  for (let i = 0; i < 10; i++) {
    try {
      await cdp.send('Runtime.evaluate', {
        expression: '1 + 1',
        contextId: guessedContextId
      });
      return; // It worked!
    } catch {
      await sleep(50 * Math.pow(2, i)); // Exponential backoff
    }
  }
  throw new Error('Frame never became ready');
}
Enter fullscreen mode Exit fullscreen mode

That's brittle. Navigation timing varies. Arbitrary delays cause false timeouts. It's not deterministic.

Instead, I adopted Playwright's ManualPromise pattern:

export class ManualPromise<T = void> extends Promise<T> {
  private _resolve!: (value: T) => void;
  private _reject!: (error: Error) => void;
  private _isDone = false;

  constructor() {
    let resolve: (value: T) => void;
    let reject: (error: Error) => void;
    super((f, r) => {
      resolve = f;
      reject = r;
    });
    this._resolve = resolve!;
    this._reject = reject!;
  }

  resolve(value: T): void {
    if (this._isDone) return;
    this._isDone = true;
    this._resolve(value);
  }

  reject(error: Error): void {
    if (this._isDone) return;
    this._isDone = true;
    this._reject(error);
  }

  isDone(): boolean {
    return this._isDone;
  }
}
Enter fullscreen mode Exit fullscreen mode

Each frame gets a contextReadyPromise that resolves when Runtime.executionContextCreated fires for its isolated world:

type FrameState = {
  frameId: string;
  contextId: number;
  bridgeObjectId: string;
  contextReadyPromise: ManualPromise<void>;
};

// When CDP tells us the context exists, resolve the promise
cdp.on('Runtime.executionContextCreated', (evt) => {
  const ctx = evt.context;
  const frameId = ctx.auxData?.frameId;
  const matchesWorld = ctx.name === this.worldName || 
                       ctx.auxData.name === this.worldName;

  if (matchesWorld && frameId) {
    const frameState = this.getOrCreateFrameState(cdp, frameId);
    frameState.contextId = ctx.id;
    frameState.contextReadyPromise.resolve(); // ← Event-driven!
  }
});

// Anywhere that needs a ready frame just awaits
async ensureFrameState(cdp: CDPSession, frameId: string): Promise<FrameState> {
  let state = this.getFrameState(cdp, frameId);

  if (state?.contextReadyPromise.isDone()) {
    return state; // Already ready
  }

  if (state) {
    await state.contextReadyPromise; // Wait for event
    return state;
  }

  // Create new state and trigger isolated world creation
  state = this.getOrCreateFrameState(cdp, frameId);

  await cdp.send('Page.createIsolatedWorld', {
    frameId,
    worldName: this.worldName,
  });

  await state.contextReadyPromise; // Wait for executionContextCreated event

  // Inject bundle into the now-ready context
  await cdp.send('Runtime.evaluate', {
    expression: BRIDGE_BUNDLE,
    contextId: state.contextId,
  });

  return state;
}
Enter fullscreen mode Exit fullscreen mode

No polling. No retries. No arbitrary delays. The browser tells us when the context exists, and we immediately proceed. If a frame detaches during injection, the frameDetached event rejects the promise—no timeouts needed.

Frame Lifecycle Tracking

I added listeners for the full frame lifecycle:

// Frame appears (lazy injection - don't inject until needed)
cdp.on('Page.frameAttached', (evt) => {
  this.getOrCreateFrameState(cdp, evt.frameId);
});

// Frame disappears (reject pending promises, clean up state)
cdp.on('Page.frameDetached', (evt) => {
  const state = sessionStates.get(evt.frameId);
  if (state && !state.contextReadyPromise.isDone()) {
    state.contextReadyPromise.reject(new FrameDetachedError(evt.frameId));
  }
  sessionStates.delete(evt.frameId);
});

// Same-document navigation (SPA routing)
cdp.on('Page.navigatedWithinDocument', (evt) => {
  const state = sessionStates?.get(evt.frameId);
  if (state) {
    state.bridgeObjectId = ''; // Invalidate bridge instance, keep context
  }
});

// Cross-document navigation (full page reload)
cdp.on('Page.frameNavigated', (evt) => {
  if (evt.frame && !evt.frame.parentId) {
    // Context will be destroyed and recreated - clear state
    sessionStates.delete(evt.frame.id);
  }
});
Enter fullscreen mode Exit fullscreen mode

Error Tracking and Failure Logs

The implementation includes comprehensive error tracking to help debug multi-frame issues:

type FrameExpansionError = {
  ref: string;        // Which iframe ref failed
  error: string;      // Error message
  detached: boolean;  // Was it a detachment (expected) or other error?
  timestamp?: number; // When did it happen?
};

// Snapshots include expansion errors
interface Snapshot {
  text: string;
  elementCount: number;
  pageContext: { url: string; title: string };
  expansionErrors?: Array<{ ref: string; error: string; detached: boolean }>;
  warnings?: {
    inaccessibleFrames?: number;
    partialContent?: boolean;
    details?: string[];
  };
}
Enter fullscreen mode Exit fullscreen mode

When frame expansion fails, the error is tracked and included in the snapshot. This gives LLMs context about why certain iframes might be missing, and helps developers debug cross-origin or timing issues.

Lazy Snapshot Expansion

At snapshot time, I recursively expand iframe markers:

async snapshot(): Promise<Snapshot> {
  const context = await this.ensureCurrentRoleContext();

  // Get main frame snapshot (includes iframe markers)
  const mainSnapshot = await context.bridgeInjector.callBridgeMethod(
    context.cdpSession,
    "snapshot",
    [],
    context.mainFrameId
  );

  // Build refIndex: main frame refs go in first
  const refIndex = new Map<string, RefIndexEntry>();
  const mainFrameRefs = mainSnapshot.text.matchAll(/\[ref=([^\]]+)\]/g);
  for (const match of mainFrameRefs) {
    const ref = match[1];
    refIndex.set(ref, { 
      frameId: context.mainFrameId, 
      localRef: ref 
    });
  }

  // Recursively expand iframes
  const expanded = await this.expandIframes(
    context,
    mainSnapshot.text,
    context.mainFrameId,
    0, // ordinal counter
    refIndex
  );

  // Store refIndex on context for interaction routing
  context.refIndex = refIndex;

  // Build snapshot with error tracking
  const snapshot: Snapshot = {
    text: expanded.text,
    elementCount: mainSnapshot.elementCount + expanded.elementCount,
    pageContext: {
      url: context.page.url(),
      title: await context.page.title(),
    },
  };

  // Add expansion errors to snapshot if any occurred
  if (expanded.errors.length > 0) {
    snapshot.expansionErrors = expanded.errors;
  }

  return snapshot;
}
Enter fullscreen mode Exit fullscreen mode

The expansion logic walks the snapshot text, finds iframe markers, resolves each to a child frame ID using CDP's DOM.describeNode, snapshots that frame, and merges the output:

private async expandIframes(
  context: RoleContext,
  snapshotText: string,
  currentFrameId: string,
  ordinalCounter: number,
  refIndex: GlobalRefIndex
): Promise<{ 
  text: string; 
  elementCount: number; 
  nextOrdinal: number;
  errors: Array<{ ref: string; error: string; detached: boolean }>;
}> {
  const lines = snapshotText.split('\n');
  const result: string[] = [];
  let totalElements = 0;
  let nextOrdinal = ordinalCounter;
  const errors: Array<{ ref: string; error: string; detached: boolean }> = [];

  for (const line of lines) {
    // Match: "- iframe [ref=e5]"
    const match = line.match(/^(\s*)- iframe(?:\s+"[^"]*")?\s+\[ref=([^\]]+)\]/);

    if (!match) {
      result.push(line);
      continue;
    }

    const indentation = match[1];
    const iframeRef = match[2];

    result.push(line + ':'); // Add colon to show children

    try {
      // Resolve iframe ref to child frame ID
      const frameInfo = await this.resolveFrameFromRef(
        context, 
        currentFrameId, 
        iframeRef
      );

      if (!frameInfo) {
        result.push(indentation + '  [Frame content unavailable]');
        continue;
      }

      const frameOrdinal = ++nextOrdinal;

      // Snapshot child frame
      const childSnapshot = await context.bridgeInjector.callBridgeMethod(
        context.cdpSession,
        "snapshot",
        [],
        frameInfo.frameId
      );

      // Recursively expand child's iframes
      const expandedChild = await this.expandIframes(
        context,
        childSnapshot.text,
        frameInfo.frameId,
        nextOrdinal,
        refIndex
      );

      nextOrdinal = expandedChild.nextOrdinal;

      // Rewrite child refs: e1 → f1_e1
      // Uses RefFormatter utility for consistent ref formatting
      const rewritten = expandedChild.text.replace(
        /\[ref=(e[^\]]+)\]/g,
        (_whole, localRef) => {
          // Only rewrite local refs, not already-qualified refs
          if (!RefFormatter.isLocal(localRef)) {
            return `[ref=${localRef}]`; // Already qualified from nested iframe
          }

          const globalRef = RefFormatter.toGlobal(frameOrdinal, localRef);
          refIndex.set(globalRef, { 
            frameId: frameInfo.frameId, 
            localRef 
          });
          return `[ref=${globalRef}]`;
        }
      );

      // Indent and merge
      for (const childLine of rewritten.split('\n')) {
        if (childLine.trim()) {
          result.push(indentation + '  ' + childLine);
        }
      }
    } catch (error) {
      // Track errors for snapshot warnings
      const isDetached = this.isFrameDetachedError(error);
      const errorMsg = error instanceof Error ? error.message : String(error);

      errors.push({
        ref: iframeRef,
        error: errorMsg,
        detached: isDetached,
      });

      // Frame detachment is normal (logged at debug level)
      if (isDetached) {
        console.debug(`Frame ${iframeRef} detached during expansion`);
        result.push(indentation + '  [Frame detached]');
      } else {
        // Unexpected errors logged as warnings
        console.warn(`Frame expansion error for ${iframeRef}:`, errorMsg);
        result.push(indentation + `  [Error: ${errorMsg}]`);
      }
    }
  }

  return { 
    text: result.join('\n'), 
    elementCount: totalElements, 
    nextOrdinal,
    errors // Returned for snapshot warnings
  };
}
Enter fullscreen mode Exit fullscreen mode

Frame Resolution via CDP

To map an iframe element to its frame ID, I use DOM.describeNode:

private async resolveFrameFromRef(
  context: RoleContext,
  parentFrameId: string,
  iframeRef: string
): Promise<{ frameId: string } | null> {
  // Get the iframe element from the parent frame's bridge
  const bridgeObjectId = await context.bridgeInjector.getBridgeHandle(
    context.cdpSession,
    parentFrameId
  );

  // Get the iframe element from the parent frame's bridge
  // Access bridge's elements Map directly to get the DOM element
  const { result } = await context.cdpSession.send('Runtime.callFunctionOn', {
    objectId: bridgeObjectId,
    functionDeclaration: `function(ref) { 
      // Get the ElementInfo which contains the actual DOM element
      const info = this.elements.get(ref);
      if (!info) return null;

      // Verify it's an iframe
      if (info.tagName.toUpperCase() !== 'IFRAME') return null;

      // Return the element itself (will have objectId)
      return info.element;
    }`,
    arguments: [{ value: iframeRef }],
    returnByValue: false, // CRITICAL: Get as remote object, not value
  });

  if (!result.objectId) {
    console.warn(`No objectId for iframe ref ${iframeRef}`);
    return null;
  }

  // Use CDP to get the child frameId
  // pierce: true enables traversal into iframe's content document
  const { node } = await context.cdpSession.send('DOM.describeNode', {
    objectId: result.objectId,
    pierce: true,
  });

  // CDP returns either node.frameId or node.contentDocument.frameId depending on version
  const childFrameId = node.frameId || node.contentDocument?.frameId;

  if (!childFrameId) {
    console.warn(
      `Element ${iframeRef} has no associated frame (might be empty or not yet loaded)`
    );
    return null;
  }

  return { frameId: childFrameId };
}
Enter fullscreen mode Exit fullscreen mode

Interaction Routing

All interaction methods parse the ref to determine which frame it belongs to:

private parseRef(ref: string, context: RoleContext): { 
  frameId: string; 
  localRef: string 
} {
  // Check if refIndex exists (should be populated by snapshot())
  if (!context.refIndex) {
    throw new Error(
      "No refIndex found. Take a snapshot first before interacting with elements."
    );
  }

  // Lookup in refIndex (includes both main frame and child frame refs)
  const entry = context.refIndex.get(ref);
  if (entry) {
    return { frameId: entry.frameId, localRef: entry.localRef };
  }

  // If not found, ref is stale or invalid
  throw new UnknownRefError(ref);
}

async click(ref: string): Promise<void> {
  const context = await this.ensureCurrentRoleContext();
  const { frameId, localRef } = this.parseRef(ref, context);

  await context.bridgeInjector.callBridgeMethod(
    context.cdpSession,
    "click",
    [localRef],
    frameId // ← Routes to correct frame
  );
}
Enter fullscreen mode Exit fullscreen mode

The same pattern applies to type(), resolve_container(), inspect_pattern(), and extract_anchors():

async type(ref: string, text: string): Promise<void> {
  const context = await this.ensureCurrentRoleContext();
  const { frameId, localRef } = this.parseRef(ref, context);

  await context.bridgeInjector.callBridgeMethod(
    context.cdpSession,
    "type",
    [localRef, text],
    frameId // Routes to correct frame automatically
  );
}

async resolve_container(ref: string): Promise<any> {
  const context = await this.ensureCurrentRoleContext();
  const { frameId, localRef } = this.parseRef(ref, context);

  return await context.bridgeInjector.callBridgeMethod(
    context.cdpSession,
    "resolve_container",
    [localRef],
    frameId
  );
}

// inspect_pattern and extract_anchors follow the same pattern
Enter fullscreen mode Exit fullscreen mode

Before and After

Old snapshot (iframe is opaque):

- button "Add to Cart" [ref=e1]
- iframe [ref=e2]
- button "Continue Shopping" [ref=e3]
Enter fullscreen mode Exit fullscreen mode

New snapshot (iframe content visible):

- button "Add to Cart" [ref=e1]
- iframe [ref=e2]:
    - heading "Payment Information" [ref=f1_e1]
    - textbox "Card Number" [ref=f1_e2]
    - textbox "Expiration" [ref=f1_e3]
    - textbox "CVC" [ref=f1_e4]
    - button "Pay Now" [ref=f1_e5]
- button "Continue Shopping" [ref=e3]
Enter fullscreen mode Exit fullscreen mode

Old interaction attempt:

await browser.click('e2'); // Clicked the iframe container (did nothing)
Enter fullscreen mode Exit fullscreen mode

New interaction:

await browser.click('f1_e5'); // Clicks "Pay Now" inside the iframe
Enter fullscreen mode Exit fullscreen mode

What Changed

Stripe checkout flows work. The test case I couldn't complete before—filling out payment details in an embedded iframe—now works end-to-end.

Nested iframes work. I can snapshot and interact with iframes inside iframes (tested 3 levels deep).

The refIndex provides O(1) routing. Every ref in the snapshot maps to { frameId, localRef }. No scanning, no heuristics, just a Map lookup.

Error tracking improves debugging. The implementation tracks frame expansion failures separately from injection failures, distinguishing between expected issues (frame detached) and unexpected errors. Snapshots include expansionErrors when iframes couldn't be accessed, and warnings appear in the output.

Event-driven lifecycle eliminated timing bugs. The old single-frame implementation had occasional "Cannot find execution context" errors during navigation. Those don't happen anymore because frame state tracking is event-driven, not timing-dependent.

Test suite expanded by 7 files and ~500 lines. I added comprehensive multi-frame tests:

  • frame-discovery.spec.ts - Frame enumeration after navigation
  • frame-resolution.spec.ts - Mapping iframe refs to frame IDs
  • iframe-snapshot-expansion.spec.ts - Snapshot recursion
  • interaction-routing.spec.ts - Interaction routing to child frames
  • iframe-edge-cases.spec.ts - Empty iframes, cross-origin, dynamic injection
  • multi-frame-bridge.spec.ts - Per-frame bridge isolation
  • iframe-visual-demo.spec.ts - Visual debugging test with nested iframes

All 188 tests still pass. The refactor touched every interaction method, but I kept the API surface identical.

The API Stayed Identical

Users calling MCP tools see no difference:

// Before multi-frame support
const snapshot = await browser.snapshot();
await browser.click('e5');
const ancestors = await browser.resolve_container('e5');

// After multi-frame support (same calls, now works with iframe refs)
const snapshot = await browser.snapshot();
await browser.click('f1_e5'); // Can target iframe content
const ancestors = await browser.resolve_container('f1_e5'); // Works inside iframes
Enter fullscreen mode Exit fullscreen mode

The ref format is the only visible change: main frame refs stay as e1, e2, etc., while child frame refs become f1_e1, f2_e3, etc. That's progressive enhancement—main-frame-only pages have identical snapshots.

Frame Detachment Detection

A critical helper distinguishes between expected frame lifecycle events and real errors:

private isFrameDetachedError(error: any): boolean {
  if (!error?.message) return false;
  const msg = error.message.toLowerCase();
  return (
    msg.includes('frame detached') ||
    msg.includes('frame has been detached') ||
    msg.includes('execution context was destroyed') ||
    msg.includes('cannot find context')
  );
}
Enter fullscreen mode Exit fullscreen mode

This pattern appears throughout the codebase—frame detachment during operations is expected behavior, not an error. The code logs these at debug level while logging unexpected errors as warnings.

Architecture Principles

Lazy injection. Bridges aren't created until the first snapshot that needs them. This keeps performance high on iframe-heavy pages where most iframes are hidden analytics trackers.

Automatic filtering. Only iframes in the accessibility tree get expanded. Hidden iframes (display: none, aria-hidden="true") are excluded automatically because they don't appear in the snapshot.

Graceful degradation. If one iframe fails to expand (cross-origin restrictions, frame detached during snapshot), the rest of the snapshot still works. The failure appears as [Frame content unavailable] in the output.

Event-driven everything. Frame lifecycle, execution context creation, navigation—all driven by CDP events. No polling, no retries, no arbitrary delays.

What I Learned

  1. Playwright's patterns are production-proven. I adopted ManualPromise, event-driven context tracking, and lazy injection - all ideas I took from Playwright. These patterns handle edge cases I wouldn't have thought of.

  2. Iframes need role assignments to be interactive. Adding case "IFRAME": return "iframe" to AriaUtils.getImplicitRole() wasn't enough—I also had to add "IFRAME" to the INTERACTIVE_ELEMENTS array so SnapshotGenerator would assign refs. Without refs, frame resolution can't work.

  3. Frame detachment is normal, not an error. Frames can detach during snapshot expansion (lazy-loaded iframes, dynamic removal). The code treats this as expected behavior, not a retry case.

  4. CDP's DOM.describeNode has an ambiguity. The docs say node.frameId contains "Frame ID for frame owner elements," but also return node.contentDocument.frameId. I implemented a fallback: node.frameId || node.contentDocument?.frameId. Empirically, both work.

  5. Same-document navigation vs cross-document navigation need different handling. SPA routing (navigatedWithinDocument) keeps the execution context alive but invalidates the bridge instance. Full navigation (frameNavigated) destroys the context entirely.

  6. The refIndex must include main frame refs. Originally I only added child frame refs to the index, which meant parseRef() needed a fallback for main frame refs. Adding all refs to the index simplified the code and made routing uniform.

  7. RefFormatter utility keeps ref handling consistent. Extracting ref formatting logic (toGlobal(), isLocal(), parse()) into a utility class prevented inline string manipulation bugs and made the ref format explicit: main frame uses e1, child frames use f1_e1.

  8. Error categorization matters. The code distinguishes between frame detachment (expected, logged at debug level) and unexpected errors (logged as warnings). This prevents log noise while still surfacing real issues.

Trade-Offs

Snapshot time increases with iframe depth. A page with 3 nested iframes takes ~3x longer to snapshot than main-frame-only. The expansion is serial (depth-first), not parallel. For typical pages (0-2 iframes), the overhead is negligible.

Cross-origin iframes are non-injectable. Chrome's security model prevents CDP from accessing cross-origin iframe content unless the page has permissive CORS headers. This is correct behavior—Verdex shouldn't bypass browser security. Cross-origin iframes appear as [Frame content unavailable] in snapshots.

The ref format changed. Frame-qualified refs (f1_e5) are longer than unqualified refs (e5). This adds ~5-10 tokens per iframe element to the snapshot. For typical usage (1-2 iframes with 5-10 elements each), that's ~50-100 tokens—acceptable for the functionality gained.

Memory overhead per frame. Each frame stores a FrameState with a ManualPromise, context ID, and bridge object ID. For pages with 10+ iframes, this adds ~1-2KB of memory per frame. Negligible for typical usage.

What's Next

Parallel frame expansion. Currently, iframe snapshot expansion is depth-first and serial. Switching to Promise.allSettled() for sibling iframes would reduce snapshot time on iframe-heavy pages.

Frame metadata in snapshots. Adding iframe name, title, or id attributes to the snapshot output would give LLMs better context about what each frame represents: [Frame: name="stripe-checkout", title="Payment Form"].

Performance budgets. For pages with excessive iframes (20+ ad trackers), snapshot time could become problematic. Adding configurable depth limits (maxIframeDepth: 2) would cap recursion.

Cross-origin detection. Currently, cross-origin frames fail silently with [Frame content unavailable]. Detecting cross-origin explicitly and showing [Cross-origin frame - content not accessible] would give clearer feedback.

Conclusion

Multi-frame support turned Verdex from "works on simple pages" to "works on real e-commerce sites." Stripe checkout flows, OAuth popups, embedded widgets—all now visible and interactive. The architecture is event-driven, lazy, and graceful under failure.

The refactor touched 8 files and added ~730 lines of implementation code. The test suite grew by 500 lines. All existing tests pass. The MCP API surface didn't change. Users get iframe support without migration.

The critical insight: iframe support isn't about iframes—it's about making the bridge lifecycle event-driven and multi-context-aware. Once that foundation exists, frame expansion and interaction routing are straightforward.

The debugging is cleaner. The behavior is deterministic. The architecture scales to nested iframes. And users can finally fill out Stripe forms.

Top comments (0)