DEV Community

éƒ­ē«‹
éƒ­ē«‹

Posted on • Originally published at blog.leeguoo.com

Letting an AI Agent Click Into Cross-Origin Iframes (How chrome-use Solves It)

šŸ’” Originally published on my blog blog.leeguoo.com — field notes on reverse engineering, AI agents, and building things that ship.

Connecting an AI agent to a browser starts out smoothly: open a page, read the content, fill in a search box. What really gets you stuck are the forms hidden inside cross-origin iframes—Google Payments payout profiles, checkout components, KYC widgets. The agent can read the text inside them and fill in values, but it just can’t click that ā€œSaveā€ button. It can see the task, but it can’t finish it.

An agent tries to reach into a ā€œwindow inside a window,ā€ only to click nothing

This is a write-up of how we got past that hurdle. The protagonist is chrome-use—a Rust-based browser automation CLI for agents that directly drives the Chrome browser where you are actually logged in, without Playwright and without headless mode.

Why cross-origin iframes are so hard

Regular pages are easy: capture the accessibility tree, get element references, and click. But cross-origin iframes—for example, an adsense.google.com page embedding a payments.google.com iframe—hit three problems at once:

  1. Selectors can’t get in. Under the same-origin policy, CSS selectors and eval running in the outer document can’t touch the DOM inside the iframe. document.querySelector is blind here.
  2. Scrolling misses the target. You think you’re scrolling the page, but the thing that can actually scroll is the scroll container inside the iframe. Wheel events go to the outer document, while the inside stays still—the target row remains ā€œoff screenā€ forever, not even visible.
  3. You’re left blindly clicking coordinates. The first two problems force you back to ā€œscreenshot + guess pixel coordinates,ā€ which is the least precise approach and the easiest way to click a neighboring field by mistake. On a form that edits global payment profile information, one wrong click can be costly.

The foundation of chrome-use: agents get ā€œreferences,ā€ not HTML

Before explaining the fix, it’s worth covering the basic design—because this is also the fundamental difference between chrome-use and the camp that feeds raw HTML to models.

Replacing a scary blob of HTML with clean references like @e1 @e2 @e3

chrome-use does not hand page source to the agent. Instead, it captures an accessibility tree snapshot, assigning each interactive element a compact reference:

- textbox "Email" [ref=e2]
- listbox "Country/region" [ref=e60]
- button "Save" [ref=e41]
Enter fullscreen mode Exit fullscreen mode

The agent acts directly on those references: fill @e2 "...", click @e41. A page costs roughly 200–400 tokens instead of a whole screen of DOM noise. This reference mechanism is exactly what makes it possible to work through iframes later—as long as the snapshot can ā€œseeā€ nodes inside the iframe, it can assign references to them.

Three hurdles, one at a time

First hurdle: make the snapshot see what’s inside the iframe.
The accessibility tree needs to pass through cross-origin iframes and include their nodes with references. After fixing that, snapshot can list them directly:

- textbox "Phone number (optional)" [ref=e59]
- listbox "Country/region code: Japan (+81)" [ref=e60]
Enter fullscreen mode Exit fullscreen mode

Where selectors can’t enter, references can.

Second hurdle: make scrolling affect the iframe’s scroll container.
Instead of sending every wheel event to the outer document, scroll the container that actually needs to scroll. The lower form rows can finally move into view, and their references become available.

Third hurdle, the hardest one: the enabled submit button inside the cross-origin iframe does nothing when clicked.
This stage is the most maddening because everything looks right:

  • The number is entered with real keystrokes, and get value confirms it is there;
  • The ā€œSaveā€ button becomes enabled when it should—it is disabled before a valid value is entered, then appears after filling;
  • Then click @e41—and the form does nothing. find text "Save"? Cross-origin access blocks it. Focus and press Enter or Space? Still nothing.

It looks correct, yet everything is wrong. The root cause: Material/framework buttons inside cross-origin iframes do not accept synthetic clicks; and fill only changed the input value without dispatching the input/change events the framework expects. The form still thinks ā€œnothing changed,ā€ so the Save button is either disabled or clicking it is equivalent to doing nothing.

The fix has two parts: value entry switches to real keystrokes so every character triggers real events that the framework recognizes; clicking dispatches a full set of real mouse/keyboard activations against the content node inside the iframe, rather than slapping a click() onto it.

The finish: click in, save successfully

An agent wearing a party hat reaches into an iframe, successfully presses SAVE, and gets a green saved checkmark

Once all three hurdles are cleared, the whole chain works: open → scroll to the target row → capture references from the snapshot → fill with real keystrokes → press Save. The deadlock of ā€œcan read it, can’t complete itā€ ends there.

A few hard-earned lessons for others building agent browser automation

  • Prefer accessibility references; don’t default to clicking screenshot coordinates. Once the snapshot can see the iframe, references are always more stable than guessing pixels. Save screenshots for cases that truly have no structure, like canvas or WebGL.
  • Cross-origin iframes are a clear boundary. Selectors and eval stop there. Either your tool can penetrate the a11y tree, or you are left blindly clicking.
  • Test whether you can submit, not just whether you can fill. A value being entered does not mean the framework received it. Bugs like fill not dispatching events only show up when you actually try to save.
  • If you can use a real logged-in browser, don’t use headless. Login state, cookies, and extensions are all already there, and there is no automation fingerprint—that is also why chrome-use takes the path of ā€œdriving your own Chrome.ā€

Try it

curl -fsSL https://raw.githubusercontent.com/leeguooooo/chrome-use/main/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

The repository is at github.com/leeguooooo/chrome-use. I keep building tools like this—tools that use your own subscriptions and connect agents to real browsers/devices—and I post updates on X @leeguooooo.


Links

  • šŸ”§ The tool: chrome-use on GitHub — drive your real, logged-in Chrome from any AI agent.
  • šŸ“ More writing: blog.leeguoo.com — I'm Guo Li (leeguoo), a full-stack dev building small AI-agent tools and CLIs.
  • šŸ’¬ Found it useful? A ⭐ on the repo or a follow here means a lot.

Top comments (0)