DEV Community

Cover image for 🧩 Runtime Snapshots #16 — The Three Architectures of Browser Agents
Alechko
Alechko

Posted on

🧩 Runtime Snapshots #16 — The Three Architectures of Browser Agents

Every team building an AI agent for the browser is making one architectural choice — whether they realize it or not. They're choosing how the LLM perceives the page. That choice cascades into everything else: cost per action, reliability on real-world apps, what gets banned by anti-bot systems, what kinds of tasks are even feasible.

The choice currently breaks down into three approaches. They're often discussed as if they're competing technologies. They aren't. They operate at different abstraction layers, with different tradeoffs, suitable for different jobs. Mixing them up is the source of a lot of recent disappointment with "browser agents that don't work."

This post lays out the three cleanly: what each one actually sees, where each shines, where each collapses.

Approach 1: Vision-Based Perception

The agent receives a screenshot of the browser viewport. A vision-capable LLM identifies elements visually and responds with coordinates. A controller moves the mouse and clicks, types into focused fields, scrolls.

Anthropic's Computer Use is the most prominent implementation of this approach. OpenAI's Operator works similarly. Both are general — they don't know they're operating a browser specifically; they're operating a screen.

What the LLM sees:

{
  "image": "<base64 screenshot, 1280x720>",
  "task": "Find the cancel subscription button"
}
Enter fullscreen mode Exit fullscreen mode

Where it works well:

Anywhere. Literal universal coverage — if a human can see it, the LLM can attempt it. Works without any installation on the user's side; the agent runs the browser. Robust to weird custom UI: canvas-rendered apps, Electron apps, video games, Flash relics, image-heavy interfaces. Doesn't care about the framework or the markup. Pixels are pixels.

Where it breaks down:

  • Cost. Vision tokens are expensive. A single screenshot can be 1000+ tokens, and agents rarely solve tasks in one screenshot. In our benchmarking, multi-step browsing tasks easily run into tens of thousands of vision tokens per session — an order of magnitude more than structured-representation alternatives. At scale this is a real economic constraint.
  • Resolution sensitivity. A button at 1280×720 isn't the same button at 1920×1080. Coordinate-based action requires consistent rendering, which means controlling the entire browser environment.
  • State opacity. Vision sees the rendered surface, not the underlying state. Forms with hidden fields, dynamic content, ARIA-only elements — the model infers everything from pixels.
  • No authenticated context. The agent runs a fresh browser. It doesn't have your sessions, saved passwords, or trusted-device cookies. Every task starts from logged-out, which is a significant constraint for anything involving personal accounts.
  • Detection profile. A driven browser is detectable. Most consumer sites use behavioral fingerprinting (Cloudflare, DataDome, PerimeterX) that flag automation regardless of visual perfection. The agent looks human visually, but the underlying browser doesn't move like a human.

Vision-based agents are best suited for arbitrary computer tasks where breadth matters more than economy — where the user can dedicate compute, and where authentication isn't required (or is set up manually at session start).

Approach 2: Accessibility-Tree Perception

The agent reads the browser's accessibility tree — the same data structure screen readers use for blind users. ARIA roles, accessible names, focusable elements. The LLM receives a structured representation of the UI, not a screenshot.

This is the path most current "browser MCP" projects have converged on. Browser Use, MultiOn, and the wave of MCP servers shipped through 2025 mostly read AXTree (Chrome's accessibility tree, exposed via DevTools Protocol).

What the LLM sees:

[1] button "Submit"
[2] textbox "Email", required
[3] link "Forgot password?", url="/reset"
[4] heading "Sign in"
[5] image "Logo"
Enter fullscreen mode Exit fullscreen mode

Cleaner than a screenshot, much cheaper in tokens, structured. The LLM reasons about elements by index; the controller dispatches actions to those indices.

Where it works well:

  • Cheap. A page that costs 5,000 vision tokens might be 500 accessibility-tree tokens. Order-of-magnitude difference.
  • Simple agent loop. "Click element [3]" is much easier to reason about than "click at coordinate (847, 432)".
  • Excellent for accessible sites. Government portals built to accessibility standards (US federal sites, EU public services done right), modern React apps with proper ARIA, content sites — these work cleanly.
  • Resolution-independent. Same tree at any viewport.

Where it breaks down:

  • The accessibility tree was not designed for AI agents. It was designed in the 1990s for screen readers operating on much simpler web pages. Modern web apps frequently have terrible A11Y compliance — interactive elements without roles, <div>s that should be buttons, custom controls that don't expose state. The tree under-represents the page badly on a large fraction of real-world sites.
  • Legacy enterprise UI is invisible. ASP.NET WebForms with postbacks, jQuery-era admin panels, banking interfaces, CRM systems, government portals not built to modern standards — these often render fine visually but have a near-empty accessibility tree. The LLM sees nothing where the user sees everything.
  • State leakage. Accessibility trees are snapshots of declared metadata, not runtime state. A multi-tab app, a modal that just opened, a value that was just typed — the tree may or may not reflect this depending on framework. Race conditions are common.
  • Detection. Accessibility-tree agents typically run via Puppeteer, Playwright, or a Chromium-driven instance. Anti-bot systems detect this profile reliably. Major consumer sites (Google products, social platforms, e-commerce) increasingly block these sessions outright.
  • Authentication. Same problem as vision-based: usually a fresh browser, no real session.

Accessibility-tree agents are best suited for modern, A11Y-compliant sites where the user doesn't need to be authenticated as themselves — which is a smaller slice of useful work than the marketing implies.

Approach 3: Runtime Structural Perception

The agent reads the live DOM directly — the actual rendered structure as it exists at the moment of perception, including computed styles, event handlers, form state, and semantics inferred from the markup itself rather than from declared accessibility metadata.

This approach typically runs as a browser extension in the user's normal authenticated session. The DOM is captured in-process, structurally compressed, and forwarded to the LLM as a representation that's denser than AXTree but vastly cheaper than vision.

What the LLM sees:

form#login (action=/auth, method=post)
  input[email] "user@example.com"
  input[password] required
  button[submit] "Sign in"
  a "Forgot password?" → /reset
  div.error.hidden "Invalid credentials"
Enter fullscreen mode Exit fullscreen mode

A snapshot of the rendered runtime, with semantic structure inferred from the markup itself rather than from what someone remembered to annotate. The hidden error div is visible to the agent because it exists in the DOM, even though it's not displayed to the user yet.

Where it works well:

  • Real apps. ASP.NET postbacks, jQuery admin panels, government portals, banking interfaces, custom enterprise UI — all render their structure to the DOM regardless of accessibility compliance. Runtime perception sees what's actually there.
  • Token-efficient. Comparable to or cheaper than accessibility trees, because the captured DOM can be aggressively compressed — stripping presentational classnames, collapsing identical sibling subtrees, indexing nodes by interactive role rather than by markup hierarchy. The result preserves structural truth while dropping rendering noise.
  • Authentication-native. The extension runs in the user's session. Cookies, OAuth tokens, MFA-completed states, saved logins — all available, because the agent isn't a separate browser. It's a peripheral to the user's own browser.
  • Detection profile. No external automation. The browser fingerprint, the network traffic, the timing patterns — all match a real user, because there is a real user. The agent assists; it doesn't impersonate.
  • State accuracy. What the DOM says is what the page is, including hidden states, form values, dynamically-added elements, and content behind dropdowns. This is the runtime ground truth.

Where it breaks down:

  • Requires user installation. An extension is a friction step. Universal coverage at the level of "anyone can use it without setup" isn't possible — the user has to install something.
  • Browser-only. Doesn't help with desktop apps, mobile native apps, or anything outside the browser. Vision-based approaches remain the only path for those.
  • Canvas and exotic rendering. A page that renders to <canvas> (some games, some data viz, some custom editors) is opaque to DOM perception. Same problem as accessibility trees, sometimes worse — vision is the only fallback.
  • Schema design matters. Capturing the DOM is easy; capturing it in a representation that's both compact and faithful is non-trivial. Bad implementations look like raw HTML dumps. Good ones look like a structural index of the page. The quality of this representation is the engineering work, and there's a wide quality spread between projects claiming to do this.

Runtime structural perception is best suited for browser-bound tasks where the user is the user — handling their own email, their own bank, their own work tools — and where economy and reliability matter more than universal coverage.

How to Pick

The three approaches aren't substitutes. They serve different jobs:

Vision-based — when the agent operates the entire computer, no extension is feasible, and the user accepts vision-token costs for breadth. Best for: research agents, autonomous task execution on shared/remote machines, general "Computer Use" scenarios.

Accessibility-based — when target sites are modern and A11Y-compliant, no authenticated session is needed, and per-action cost must be minimized. Best for: research and scraping at scale, public-information lookups, content-site automation.

Runtime structural — when the user wants AI assistance with their own browser, on real-world apps, with their own credentials, at low marginal cost per action. Best for: personal-context productivity, enterprise internal tooling, government and banking interfaces.

A serious browser-agent product probably uses more than one. The interesting engineering question is how to pick which perception layer for which task — and how to compose them when one doesn't suffice.

Why This Matters

Most public discourse about browser agents currently bundles all three together under labels like "AI browser automation" or "browser MCP." This is a category error that's costing the field clarity.

A team building runtime structural perception inherits the reputation of A11Y-based tools that get banned and break on real apps. An A11Y-based tool inherits the cost concerns of vision-based agents. The architectural choices are getting evaluated on the wrong axes.

The taxonomy above is an attempt to name what's already happening underneath the marketing. The three approaches exist. They have measurably different cost curves, different reliability profiles on different site types, and different relationships to authentication and anti-bot infrastructure. Confusing them produces bad architectural decisions and worse user experiences.

Whether you're building, picking, or evaluating a browser agent — start by being explicit about which of the three you're committed to, and why. The rest of the design follows from that choice more than from any other.


e2llm.com

This is part 16 of the Runtime Snapshots series — exploring how structured browser data changes the way we build, test, and ship software. We've spent the last ~18 months building in the third category, and the engineering details behind it — schema design, snapshot semantics, why most implementations don't actually work — live in the earlier entries of this series.

If you've moved between approaches — what triggered the migration?

Top comments (0)