DEV Community

Cover image for How We Made Our AI Browser Agent Stop Clicking the Wrong Button
Omid Seyfan
Omid Seyfan

Posted on • Originally published at smoketest.sh

How We Made Our AI Browser Agent Stop Clicking the Wrong Button

At Smoketest.sh, you describe a flow in a sentence ("log in, add a paid seat, confirm the invoice updates") and an AI agent runs it in a real browser. The agent reads the page, decides what to do, and drives Playwright to do it.

The first version worked great in the demo and fell apart on the second run. This is the story of why, and the fix that made element targeting reliable: never let the model invent a selector. Hand it stable IDs from the accessibility tree and make it point at those.

TL;DR

  • Letting an LLM target page elements by natural-language description is flaky. The description is regenerated every run and rarely resolves to exactly one element.
  • page.ariaSnapshot({ mode: 'ai' }) returns the page as a role-based tree and stamps every interactive element with a stable [ref=eN] ID.
  • Playwright resolves aria-ref=eN as a first-class locator, so the model can act on the exact element it just saw.
  • Make the model cite refs from the tree and keep description as a fallback only. The wrong-element problem mostly disappears.

Here is how each piece works.

Why "click the Sign in button" is flaky

The naive design is the obvious one. Give the model a click tool that takes a description, and let it figure out the rest:

// tempting, and wrong
click({ description: "the Sign in button" })
Enter fullscreen mode Exit fullscreen mode

Under the hood you turn that string into a locator. On a clean login page, getByRole('button', { name: 'Sign in' }) finds exactly one element and it works. Ship it, watch the demo pass, feel good.

Then it meets a real app:

  • There are three things matching "Sign in": a nav link, a footer link, and the actual button. The locator resolves to a list, and Playwright clicks the first one, which navigates somewhere you did not expect.
  • The button text is "Sign In" this week and "Log in" after a copy change. The description the model wrote last run no longer matches.
  • The model rewords its own description between runs. "the Sign in button" becomes "the blue login button at the top right," and now your role-and-name lookup misses entirely.

None of these are bugs in the model. They are the consequence of using a regenerated English phrase as a selector. The phrase is fuzzy by construction, and fuzzy selectors on a busy page do not resolve to one element.

The accessibility tree is the agent's source of truth

The fix starts by changing what the model looks at. Instead of letting it guess from a screenshot or raw HTML, we hand it Playwright's accessibility snapshot in AI mode, a compact view of the page's accessibility tree. That is one tool:

{
  name: 'getAccessibilityTree',
  description:
    'Return a structured representation of page content as an accessibility tree to understand the page.',
  parameters: { type: 'object', properties: {} },
  execute: async () => {
    const tree = await page.ariaSnapshot({ mode: 'ai' });
    return { tree };
  },
}
Enter fullscreen mode Exit fullscreen mode

page.ariaSnapshot({ mode: 'ai' }) returns the page as a compact, role-based tree. The important part of AI mode: every interactive element gets a [ref=eN] tag. A login page comes back looking roughly like this:

- heading "Welcome back" [level=1]
- textbox "Email" [ref=e4]
- textbox "Password" [ref=e5]
- button "Sign in" [ref=e6]
- link "Forgot password?" [ref=e7]
Enter fullscreen mode Exit fullscreen mode

The model no longer has to describe the button. It can refer to e6. That ref is the contract between "what the model saw" and "what Playwright clicks," and it is the whole game.

This is the same structured-snapshot approach Microsoft's Playwright MCP server takes: let the model act on accessibility refs, not on pixels or guesses.

aria-ref is a first-class Playwright locator

The reason refs work is that Playwright resolves them directly. aria-ref=e6 is a real locator engine, not something we built. So the click tool prefers the ref and only falls back to a description when it has none:

execute: async ({ ref, description }) => {
  const refStr = ref?.trim() || null;
  const text = description?.trim() || null;

  if (!refStr && !text) {
    throw new Error('click requires either ref or description');
  }

  const locator = refStr
    ? page.locator(`aria-ref=${refStr}`)        // stable: resolves against the snapshot
    : await resolveLocator(page, text!);         // fallback: fuzzy, best-effort

  await locator.click();
  // ...
}
Enter fullscreen mode Exit fullscreen mode

The ref path is stable because it is resolved against the exact snapshot the model just read, not re-derived from a phrase. Same idea for fill, select, and getText. Every interaction tool takes ref first and description second.

The model needs rules, not just tools

Tools that accept a ref are not enough. The model will still reach for a description if you let it, because describing things in English is what language models love to do. So the rules you give it have to make the ordering non-negotiable:

  • Read the accessibility tree before touching a page it has not seen.
  • On every action, prefer the ref over a description.
  • After a failed action, read the tree again for fresh refs instead of rewording the description.

That last rule is the one that earns its place. The instinct of a language model after a failed action is to try a more elaborate description. That is exactly the wrong move, because the description was never the reliable path. Re-reading the tree gives it fresh refs that match the current DOM, which is what actually changed.

When there is no ref, fall back deliberately

Refs are not always available. The model might be acting on something it inferred rather than something in the last snapshot. So resolveLocator is a deliberate ladder, not a single guess. For each candidate phrase it tries role, then label, then placeholder, then text, and takes the first one that is actually visible:

for (const phrase of phrases) {
  if (roleHint) {
    const roleLocator = page.getByRole(roleHint, { name: phrase, exact: false });
    if (await isVisible(roleLocator)) return roleLocator;
  }

  const labelLocator = page.getByLabel(phrase, { exact: false });
  if (await isVisible(labelLocator)) return labelLocator;

  const placeholderLocator = page.getByPlaceholder(phrase, { exact: false });
  if (await isVisible(placeholderLocator)) return placeholderLocator;

  const textLocator = page.getByText(phrase, { exact: false });
  if (await isVisible(textLocator)) return textLocator;
}

throw new Error(`Could not find a visible element for description: ${description}`);
Enter fullscreen mode Exit fullscreen mode

isVisible is a 5-second waitFor({ state: 'visible' }) wrapped in a try/catch, so a candidate that exists but is hidden does not win. The phrase extraction pulls quoted substrings out of the description first ("click the button labeled \"Place order\"" yields Place order), so the model's verbosity does not poison the match.

This is the fuzzy path, and we treat it as such. It is good enough to recover, and it is exactly why we want the model on refs whenever it can be.

Don't fail with "element not found"

When even the fallback misses, the worst thing you can return is a bare "element not found." The model has nothing to act on and will flail. So a failed click collects diagnostics about what the page actually contains and returns them with the error:

const diagnostics = await collectClickDiagnostics(page, text!);
throw new Error(`${getErrorMessage(error)}. Diagnostics: ${JSON.stringify(diagnostics)}`);
Enter fullscreen mode Exit fullscreen mode

collectClickDiagnostics counts how many elements matched by role, by label, and by text, and includes a sample of the page's links:

return {
  description,
  roleHint: roleMatch?.role ?? null,
  roleCount,    // e.g. 0 buttons matched
  labelCount,   // e.g. 0 labels matched
  textCount,    // e.g. 3 text nodes matched
  sampleLinks: linkSamples,
  currentUrl: page.url(),
};
Enter fullscreen mode Exit fullscreen mode

Now the failure is legible. textCount: 3, roleCount: 0 tells the model (and us, in the trace) that the thing it called a button is really three pieces of text, so it should re-read the tree and target a real interactive element. The recovery loop closes because the error carries enough to act on.

There is also a small specialization for links: if the model meant to click a link and the locator missed, we look up the href by matching link text or aria-label and navigate directly, which sidesteps a whole class of overlay-and-intercept clicks.

The trade-offs, honestly

This is reliable element targeting, not a deterministic agent. Two limits worth stating plainly:

  • Refs are only valid for the snapshot you took. After a navigation or a DOM change, e6 may point at nothing or at the wrong node. That is why the prompt forces a fresh getAccessibilityTree after failures and on new pages. Treat refs as per-snapshot, not durable.
  • Snapshots cost tokens. An accessibility tree of a content-heavy page can be large, and feeding one to the model after every navigation adds up fast. We wrote about that cost in detail in What Works (and What Breaks) Running Playwright MCP in Claude Code, where a single snapshot can run to tens of thousands of tokens. The reliability is worth it for us, but it is a real line on the bill, and it is why we do not re-snapshot on every single step, only when the page is new or something failed.

And the model still decides what to do. Refs make sure that when it decides to click the Sign in button, it clicks that button and not the footer link with the same name. They do not stop it from deciding to click the wrong thing in the first place. That is a different problem, solved with a separate evaluation pass.

If you are building an LLM browser agent

The one idea to take away: do not let the model emit selectors. A selector invented from an English phrase is regenerated every run and rarely resolves to one element. Instead,

  1. Give the model a structured snapshot of the page with stable IDs (page.ariaSnapshot({ mode: 'ai' })).
  2. Make every action tool take an ID first and a description only as a fallback (page.locator('aria-ref=eN')).
  3. Enforce snapshot-then-act in the system prompt, and re-snapshot on failure instead of rewording.
  4. Return rich diagnostics on a miss so the model can recover instead of guessing.

That sequence is what moved our agent from "passes the demo" to "passes on the second run, and the hundredth."

Try it on your own app

We run this in production at Smoketest. You describe the flows that matter (login, checkout, onboarding, billing), and we run them in a real browser after every deploy and tell you what broke. No Playwright suite for you to own or maintain. Take a look at smoketest.sh.

Top comments (0)