Gaurav Chodwadia

Posted on Apr 13

Giving AI Agents Eyes (Part 1): 6 Tricks for Reading Web Pages Without Vision Models

#a11y #frontend #ai #webdev

There's a growing class of AI agents that don't browse the web autonomously — they live inside web applications. Figma AI lives inside the design tool. Notion AI lives inside the document editor. GitHub Copilot lives inside the IDE. And increasingly, enterprise SaaS platforms are embedding AI assistants directly into their dashboards.

These in-app agents face a problem that standalone chatbots don't: the user is staring at a complex UI — data tables, status badges, filters, modals — and expects the agent to see it too. Not a screenshot. Not a URL. The actual semantic content of what's on screen.

We hit this building an AI assistant for a large enterprise admin dashboard with dozens of page types. The assistant knew which page the user was on (via the URL), but had zero visibility into what was displayed on that page. The user asks "which items have errors?" while looking at a table of 25 items — and the agent has no idea.

The breakthrough came from an unlikely — but in hindsight, obvious — source: accessibility trees.

The Representation Problem

Before diving into the solution, let's frame the problem. You need to give an LLM a representation of what's currently on a web page. Your options:

Raw HTML — Feeding the DOM directly to an LLM is like handing someone the source code of a novel and asking them to summarize the plot. A typical admin dashboard page is ~150 KB of HTML. That's ~37,500 tokens — most of it CSS classes, data attributes, wrapper divs, and structural noise the model has to wade through.

Screenshots — Vision models have gotten remarkably good at reading pages, but they still struggle with dense data tables (25+ rows), and they can't see what's in the DOM: ARIA attributes like haspopup and expanded, disabled states that aren't visually distinct, or which element has focus. They also cost ~1,500-3,000 tokens per image and add capture latency. For data-heavy enterprise UIs, text representations are more reliable and cheaper.

Markdown conversion — Tools like Turndown can convert HTML to Markdown. It's readable, but you lose all interactive state (which button is disabled? which tab is selected? which checkbox is checked?). And it still runs 3-5x more tokens than what we ended up with.

DOM-to-JSON — Serializing the DOM tree to JSON preserves structure but is absurdly verbose. Even pruned, a typical page produces ~20-50 KB of JSON. The LLM has to navigate nested objects full of <div> wrappers that carry zero semantic meaning.

The winner? None of the above. The answer was hiding in plain sight — in the same tree that screen readers have used for decades.

Why Accessibility Trees

An accessibility tree is a parallel representation of the DOM that browsers maintain for screen readers. Its entire purpose is to answer the question: how do you describe this page to someone who can't see it? It strips away visual styling and structural noise, keeping only what matters: roles, names, states, and values.

That's exactly the question we're asking on behalf of an LLM. Screen readers and language models need the same thing — a compact, semantic, text-based description of what's on screen. The a11y tree has been solving this problem for decades. We just pointed it at a different consumer.

Here's what our a11y tree output looks like for an items catalog page:

[heading level=1] Catalog
[tab selected] All
[tab] Unpublished (39746)
[table]
  [row] [columnheader] Item Name | [columnheader] SKU | [columnheader] Status | [columnheader] Price
  [row] [cell] Laptop Pro 15 | [cell] sku-0082 | [cell] Unpublished | [cell] $299.00
  [row] [cell] Widget B | [cell] sku-1234 | [cell] Published | [cell] $49.99
[status] Showing 1-25 of 114,827 items

That's ~750-1,250 tokens for a page that would be ~37,500 tokens as raw HTML. A 30-50x reduction with zero information loss for what the LLM actually needs.

This isn't a novel idea. Playwright MCP uses accessibility snapshots for its browser_snapshot tool. Claude's Chrome extension uses a DOM walker for its read_page tool. AgentOccam showed that plain a11y trees match or beat vision-augmented approaches on web agent benchmarks. The a11y tree is emerging as the standard representation for giving LLMs page comprehension — and for good reason.

But most implementations stop at "just use the a11y tree." What follows isn't a collection of novel inventions — role classification is how browsers already compute the tree, name resolution is the W3C spec, and table deduplication is common sense once you see the token waste. Individually, none of these are surprising. But nobody writes down the full list of things you actually need to handle before an a11y tree works reliably in production. We learned each of these the hard way, so you don't have to.

Trick 1: Role Classification Eliminates Div Soup

Modern web apps are drowning in wrapper divs. A single button might be nested 10-20 levels deep:

<div class="wrapper">
  <div class="inner">
    <div class="container">
      <div class="btn-group">
        <button>Edit Item</button>
        <button>Delete Item</button>
      </div>
    </div>
  </div>
</div>

If you naively walk the DOM and emit every element, your output is mostly indentation and empty wrapper lines. The fix is a three-way role classification system:

Leaf roles (terminal nodes — extract name, stop recursing): button, link, textbox, checkbox, radio, img, switch, slider, menuitem

Container roles (structural — recurse into children): table, dialog, navigation, form, grid, tablist, menu, region

Transparent containers (invisible wrappers — skip element, promote children): any <div>, <span>, or element with no semantic role

With transparent container promotion, the deeply nested example above collapses to:

[button] Edit Item
[button] Delete Item

The leaf/container distinction is equally important. When the walker hits a <button>, it extracts the button's accessible name and stops. It doesn't descend into the button's inner <span> + <svg> icon structure to produce noise. The accessible name already captures what matters.

This three-way classification is what lets us fit a complex admin page into ~500 nodes and ~1,000 tokens. Without it, you'd blow through any reasonable token budget on structural noise alone.

Trick 2: Name Resolution — Six Fallbacks Before Giving Up

Determining what to call each element is harder than it sounds. We use a priority chain that mirrors the W3C accessible name computation:

Priority	Source	Example
1	`aria-label`	`<button aria-label="Close dialog">X</button>` -> "Close dialog"
2	`aria-labelledby`	References another element's text by ID
3	`alt` attribute	`<img alt="Product photo">` -> "Product photo"
4	`<label for="...">`	Associated form label
5	`placeholder`	Input placeholder text
6	`title` attribute	Tooltip fallback
7	Text content	`<button>Save Changes</button>` -> "Save Changes"

The good news: most well-built apps work fine at priority 7. Buttons have visible text, headings have visible text, links have visible text. ARIA attributes are enhancements, not requirements. If your app uses a component library with semantic HTML (<button>, <table>, <h2>, <a href>), the walker assigns correct roles automatically — no ARIA needed.

Trick 3: Visual Cue Annotations

Here's a gap that surprised us. The a11y tree captures semantic structure perfectly — but it doesn't capture visual presentation. During testing, a user asked "what is this blue alert?" about an info banner. The LLM couldn't identify it because the a11y tree rendered it as plain text with no color or severity metadata.

The same problem hits status badges ("Published" is green, "Error" is red), highlighted rows, and icon-only indicators. The user sees color-coded meaning; the LLM sees flat text.

The solution: a static map of CSS selectors to semantic annotations, checked via Element.matches() during the tree walk.

const VISUAL_CUE_MAP: Array<[string, string]> = [
  ['.alert-info, .banner--info',         'visual=blue-info-banner'],
  ['.alert-warning, .banner--warning',   'visual=yellow-warning-banner'],
  ['.alert-error, .banner--error',       'visual=red-error-banner'],
  ['.badge--success, .status--active',   'visual=green-badge'],
  ['.badge--error, .status--error',      'visual=red-badge'],
];

function getVisualCue(el: Element): string | null {
  for (const [selector, annotation] of VISUAL_CUE_MAP) {
    if (el.matches(selector)) return annotation;
  }
  return null;
}

The enriched output:

[region visual=blue-info-banner] This item requires attention. Review the listing.
[cell visual=green-badge] Published
[cell visual=red-badge] Error

Zero runtime cost (CSS selector matching is near-instant), fully deterministic, and the LLM prompt can explain what each annotation means. The trade-off is maintaining the selector map as the design system evolves — but that's a small price for giving the LLM color awareness.

Token impact is negligible: ~2-5 extra tokens per annotated element, ~20-100 total per page.

Trick 4: Hidden Controls Discovery via ARIA Hints

Here's a problem unique to rich web applications: many critical controls are hidden. Dropdown menus, popup editors, modals, side drawers — they only exist in the DOM after a trigger element is clicked. The a11y tree captures the trigger but not what it opens.

On a single catalog page, we found 9 distinct hidden control types: item detail popups, inline price editors, action menus (edit/retire/delete), shipping configuration panels, lifecycle filters, price range filters, fulfillment type filters, sort drawers, and a full filter panel with 19 expandable sections.

The a11y tree sees: [button collapsed] $ 100.33. It doesn't know that clicking it opens a pricing editor with competitive pricing data, a base price input, and Apply/Close buttons.

The partial solution comes from ARIA attributes that are already in the DOM:

[button collapsed haspopup=menu] $ 100.33
[button collapsed haspopup=dialog] --
[button collapsed haspopup=listbox] Lifecycle

aria-haspopup tells you something is behind the button. aria-controls can reference the target element by ID. The LLM now knows enough to say "click the price value — it opens a pricing menu" instead of giving generic instructions.

For high-value pages, we layer a static action catalog on top — a JSON registry mapping trigger types to available actions:

const KNOWN_ACTIONS = {
  "price-editor": {
    trigger: "Click price cell in table",
    contains: ["Base price (editable)", "Competitive price",
               "Buy Box price", "Active pricing programs"],
    actions: ["Update base price", "View competitive pricing"]
  },
  "actions-menu": {
    trigger: "Click three-dot icon in row",
    contains: ["Edit item", "Retire item", "Delete item"],
    actions: ["Edit", "Retire from marketplace", "Delete permanently"]
  }
};

The ARIA enrichment is automatic (works on every page). The action catalog is manual but provides specifics for the pages that matter most. Together, they bridge the gap between "I see a clickable element" and "I know what it does."

Trick 5: The Stale Snapshot Problem

This one bit us hard. The a11y tree is captured at a point in time — but if you capture at page load, you get the loading state. Skeleton screens. Spinner text. "Loading..." placeholders.

Here's the timeline of the bug:

T=0ms     User navigates to /catalog
T=200ms   Page shell renders (skeleton UI)
T=300ms   Data fetch fires (GET /api/items)
T=1000ms  A11y tree captured -> gets "Loading..." skeleton
T=1500ms  API response arrives -> React renders actual table
T=2000ms  User asks "What items do I have?"
T=2000ms  Message sent with stale "Loading..." snapshot

Our initial approach was a 1000ms debounce after navigation plus a MutationObserver that re-extracted on significant DOM changes (5+ added/removed nodes). But the MutationObserver had its own 1500ms debounce, and by the time it fired, the context had already been sent.

The fix was conceptually simple: re-extract at the moment the user sends a message, not at page load. When the user hits Send, the frontend captures a fresh a11y tree snapshot synchronously (~5-10ms on 500 nodes) and attaches it to the message payload. The snapshot is always current because it reflects exactly what the user sees when they ask their question.

We kept the background extraction as a pre-cache for proactive features, but the on-send extraction always wins for message context. The MutationObserver still monitors for table row additions (a good heuristic for "data just loaded") to keep the background cache fresh, but it's no longer the critical path.

Trick 6: Structured Data Extraction and Table Deduplication

The a11y tree handles layout and UI state well, but for data tables it represents values positionally — the LLM has to count cells to figure out which column a value belongs to. Ask "what's the price of Laptop Pro 15?" and the model needs to count across: Item Name, SKU, Status, Price. For a table with 25 rows and 11 columns, this is error-prone.

The fix: for pages with data tables, extract a parallel structured data representation — read thead for headers, map each tbody row by position, and output clean JSON with explicit header-to-value mapping:

{"headers": ["Item Name", "SKU", "Status", "Price"],
 "data": [
   {"Item Name": "Laptop Pro 15", "SKU": "sku-0082", "Status": "Unpublished", "Price": "$299.00"},
   {"Item Name": "Widget B", "SKU": "sku-1234", "Status": "Published", "Price": "$49.99"}
 ]}

Now the LLM doesn't count — it reads "Price": "$299.00" directly.

But this creates a duplication problem. The table data now appears in both the a11y tree and the structured JSON. On a catalog page with 25 rows, that wastes ~400-700 tokens — 35-40% of the combined payload.

The fix is conditional exclusion: when the structured data extractor succeeds, skip table-related roles during the a11y tree walk.

const TABLE_ROLES = new Set([
  'table', 'rowgroup', 'row', 'columnheader',
  'cell', 'rowheader', 'gridcell'
]);

if (structuredData) {
  a11yResult = buildA11yTreeSnapshot(root, { excludeRoles: TABLE_ROLES });
} else {
  a11yResult = buildA11yTreeSnapshot(root); // full tree as fallback
}

The a11y tree shrinks from 7,618 chars to 505 chars (93% reduction). Total prompt length drops 35%. The table data lives exclusively in the structured JSON where the LLM has explicit header-to-value mapping — no positional counting needed.

The key is making it conditional. Pages without a registered extractor still get the full a11y tree as their only table representation. The skip only activates when structured data provides a superior alternative.

Putting It All Together: The Prompt Template

Every trick above feeds into one thing: the system prompt the LLM actually sees. Here's what the assembled prompt looks like for a data table page (sanitized from our production template):

You are an AI assistant for [platform name].

The user is on a page in the platform. Below is a description of what's
currently visible on their screen.

--- Structured Data (machine-readable table data) ---
Use this for data questions (counting, comparing values, filtering):
<structured_data>
{"headers": ["Item Name", "SKU", "Status", "Price"],
 "data": [
   {"Item Name": "Laptop Pro 15", "SKU": "sku-0082", "Status": "Unpublished", "Price": "$299.00"},
   {"Item Name": "Widget B", "SKU": "sku-1234", "Status": "Published", "Price": "$49.99"}
 ]}
</structured_data>

--- Page Content (layout and UI elements) ---
Use this for layout/navigation questions (where is X, what buttons exist):
<page_content>
[heading level=1] Catalog
[tab selected] All
[tab] Unpublished (39746)
[button collapsed haspopup=menu] Filters
[button collapsed haspopup=menu] Sort
[searchbox] Search items
[status] Showing 1-25 of 114,827 items
</page_content>

--- Hidden Controls (not visible in page content above) ---
These controls exist but appear only after clicking a trigger element.
<hidden_controls>
- Click any item name to open a detail popup showing full item info,
  images, and status history
- Click a price cell to open a pricing editor with base price,
  competitive pricing, and Buy Box data
- Click the three-dot icon on any row for actions: Edit, Retire, Delete
- Click "Filters" to expand 19 filter sections: Lifecycle, Price Range,
  Fulfillment Type, etc.
- Click "Sort" to choose: Item Name, Price, Status, Date Created
- Type in the search box to filter items by name, SKU, or GTIN
</hidden_controls>

Page Name: Catalog

A few things to notice:

Two representations, not one. Structured data (JSON) handles precise data questions — "what's the cheapest item?" requires comparing values, which JSON makes trivial. The a11y tree (text) handles spatial questions — "what tabs are available?" or "is there a search box?" The LLM is told which to use for what.

Section headers are instructions. "Use this for data questions" and "Use this for layout questions" aren't decorative — they steer the model's attention. Without them, the LLM sometimes ignores the structured data and tries to answer data questions from the a11y tree text, which requires positional counting and fails on large tables.

The a11y tree is deduplicated. Notice the page content section has no table rows — those live exclusively in the structured JSON. This is Trick 6 in action, saving ~35% of the combined token budget.

Hidden controls fill the gap. The a11y tree shows [button collapsed haspopup=menu] Filters but can't describe what's inside. The hidden controls section tells the LLM there are 19 filter sections — so it can say "click Filters to access Lifecycle, Price Range..." instead of "there's a Filters button."

Total cost: ~1,500-2,500 tokens for a page that would be ~37,500 tokens as raw HTML. That's the 30-50x reduction with full context preserved — structured data for precision, a11y tree for layout, hidden controls for discoverability.

This is the extraction side of the problem. How you use this context — query classification, routing, RAG enrichment — is up to your agent architecture.

What I'd Do Differently

Start with on-send extraction, not page-load extraction. We spent cycles debugging stale snapshot timing issues that would have been avoided entirely by capturing at message-send time from the start.

Build the structured data extractor as generic, not page-specific. Our first extractor was custom for one page type. But the logic — read headers from thead th, map row cells by position — works on any standard HTML table. A generic table auto-detector would have covered 80% of pages with zero per-page work.

Don't skip SVGs entirely. We initially skipped all <svg> elements as "visual-only." But many convey meaning — checkmarks, warning triangles, info circles. Checking for aria-label, parent labels, and <title> elements recovers semantic meaning from icons that would otherwise produce zero output.

The Numbers

For a typical data table page with 25 rows:

Representation	Size	Tokens
Raw HTML	~150 KB	~37,500
DOM JSON (pruned)	~20-50 KB	~5,000-12,500
Markdown	~10-20 KB	~2,500-5,000
A11y tree	~3-5 KB	~750-1,250
A11y tree + structured JSON	~5-8 KB	~1,250-2,000
A11y tree (deduplicated) + JSON	~3-5 KB	~800-1,250

The a11y tree approach gives us a 30-50x reduction over raw HTML while preserving everything the LLM needs: semantic roles, interactive states, element names, and current values. The deduplication trick shaves another 35% when structured extractors are available.

Extraction takes <10ms on the main thread for 500 nodes. No external dependencies. No vision models. No API calls. Just a recursive DOM walk using the same principles screen readers have relied on for years.

DEV Community