DEV Community: Gaurav Chodwadia

Giving AI Agents Eyes (Part 2): From Page Snapshots to Interaction Traces

Gaurav Chodwadia — Mon, 18 May 2026 17:50:19 +0000

In Part 1, we solved the representation problem: how to give an LLM a compact, semantic view of a web page using accessibility trees. That gave our AI agent the ability to answer "what is on this page?"

But users don't ask that. They ask "what did I just click?" and "what was in that popup I closed?" Here's the gap, concretely:

T=0s    User clicks "Blue Widget" in a data table
T=1s    Popup appears with item details (price, status, GTIN)
T=3s    User closes popup
T=5s    User opens AI assistant, asks "what's wrong with this item?"
T=5s    Assistant captures page state: sees 25 items in table, no popup
T=5s    Assistant has zero context about which item or what was in the popup

A page snapshot is a photograph. The user is asking about a video. The agent needs two things it doesn't have: (1) a log of recent interactions (the user clicked "Blue Widget"), and (2) snapshots of ephemeral UI that no longer exists (the popup that showed price $24.99 and status Published).

Session replay tools (rrweb, FullStory, LogRocket) solve a related problem, but they produce DOM serialization data designed for visual playback — not semantic descriptions for LLM consumption. You need ~200 tokens of natural language, not 200 KB of mutation records.

This post covers how we built a user-activity tracker from inside a Module Federation remote that doesn't own the host page, three bugs that compounded into a "one behind" symptom, and how we serialize 20 interaction events into a few hundred tokens of LLM-friendly text.

Strategy Pattern for Extensible Event Capture

Click events are one source — but probably not the only one we'll ever want. The host app might later emit structured events via a PubSub system; a future host could fire shell-level navigation events that we'd want to record too. We didn't want to rewrite the tracker the day either of those landed.

                     ActivitySource (interface)
                    /         |          \
                   /          |           \
        DomActivitySource  HostEventSource  HybridSource
        (capture-phase       (PubSub from    (both, with
         click listener)      host app)       priority rules)
                   \          |          /
                    \         |         /
                     useActivityTracker (hook)
                             |
                     Feature flag selects source at runtime:
                     "dom" | "host" | "hybrid" | "off"

The interface is minimal:

interface ActivitySource {
  start(): void;
  stop(): void;
  getTrace(): UserActivityTrace;
  clear(): void;
  onEvent?: () => void;  // callback when a new event is captured
}

Sources are plain classes; the React layer is one thin hook (useActivityTracker) that owns construction and cleanup. Hooks would have worked too — this isn't a religious choice. We went with classes for three pragmatic reasons:

Composability without React. HybridSource creates a DomActivitySource and HostEventSource internally and merges their outputs. Doable with hooks, but composition requires wrapper components or hook-forwarding patterns; instantiating two class instances and merging is simpler.
Testability without rendering. You can instantiate a DomActivitySource in a unit test with jsdom, call start(), simulate clicks, and assert on getTrace() — no renderHook, no act(), no React tree.
Future-proofing for non-React consumers. If the tracker ever needs to run in a worker, a vanilla shell, or a non-React MFE, a plain class moves over unchanged.

Capture-Phase Click Listeners

Our AI assistant is loaded into a large SaaS dashboard as a Module Federation remote. It's not an iframe — it shares the same document as the host application. This is the architectural fact that makes activity tracking possible.

Because we share the DOM, a single listener captures every click on the host page:

document.addEventListener("click", this.handleClick.bind(this), { capture: true });

The capture phase ({ capture: true }) is important. It fires before the target element's own click handlers, before any bubble-phase stopPropagation() calls in the host app can prevent us from seeing the event. (Caveat: a host-app listener installed earlier on document in the same phase that calls stopImmediatePropagation() will still hide events from us. We've never seen this in the wild — most apps only stopPropagation() from inside component handlers, which is bubble-phase — but it's the one way the host can blind us if they want to.)

But we don't want to log every click. A click on a wrapper div, a scroll container, or a decorative icon is noise. The extractClickDescriptor function walks up from the event target to find the nearest actionable element:

const ACTIONABLE_ROLES = new Set([
  "button", "link", "menuitem", "tab", "checkbox",
  "radio", "combobox", "textbox", "searchbox", "switch", "row",
]);

function findNearestInteractive(target: Element): Element | null {
  let el: Element | null = target;
  while (el) {
    // Skip our own UI
    if (el.matches('[data-assistant-container="true"]')) return null;
    // Respect opt-out attribute
    if (el.matches("[data-no-track]")) return null;

    const role = getRole(el);
    if (role && ACTIONABLE_ROLES.has(role)) return el;
    el = el.parentElement;
  }
  return null;
}

The walk-up pattern is essential because the actual click target is usually a child of the interactive element — a <span> inside a <button>, an <svg> inside a link. The allowlist focuses on ARIA roles that represent user-initiated actions, which eliminates the vast majority of noise.

Additional noise filters:

Debounce — Same element clicked within 300ms is suppressed (double-click)
Name required — Elements with no accessible name and no text content are ignored
Container guard — Clicks inside the AI panel itself are excluded

The result on a typical data-table page: roughly one actionable event per real click (occasionally two when a row + nested button both qualify), out of dozens of raw click events the page absorbs. No firehose.

Three Bugs That Compounded

The symptom looks like one bug. It's three.

Your activity trace is always one interaction behind. The user clicks Item A, asks the assistant, and the assistant either sees nothing or sees the previous click. That's the single symptom we chased.

What we actually had was three failure modes of the same underlying mismatch — synchronous DOM events meeting asynchronous React state. Each gave the bug somewhere to hide. Fixing any one of them in isolation didn't move the needle.

The setup: The remote MFE captures DOM events synchronously. The captured data flows into React state, which flows through React Context to a consumer component in a different MFE (the AI chat UI), which reads the context when the user sends a message.

Bug 1: Conditional trace inclusion. Page-content extraction fires under three triggers — initial page load, significant DOM mutations, and route navigation. We initially only included the activity trace when the trigger was "message-send". The other triggers produced snapshots without traces. React deduplication kept the traceless version in state, so the chat UI read an empty trace at send time.

Fix: Always include the trace in every extraction, regardless of trigger.

Bug 2: No extraction on click events. Click events trigger none of the three extraction conditions above. The snapshot already in state has the trace from the last extraction — which captured clicks before the latest one.

Fix: Add an onEvent callback to the source interface. When a click is captured, fire the callback. In the page-content hook, the callback patches React state with the latest trace:

// In the page-content hook
const handleActivityEvent = useCallback(() => {
  const trace = getTraceRef.current();
  setSnapshot((prev) => {
    if (!prev) return prev;
    return { ...prev, userActivity: trace };
  });
}, []);

// In the DOM source's click handler
this.events.unshift(descriptor);
this.onEvent?.();  // patch React state immediately

Bug 3: Undeclared class property. The ActivitySource interface declares onEvent?: () => void, and the hook assigns it after construction:

source.onEvent = () => onEventRef.current?.();
source.start();

But the class never declared onEvent as a field:

// BROKEN — onEvent is not declared on the class
class DomActivitySource implements ActivitySource {
  private events: ActivityEvent[] = [];
  // ...
  handleClick(event: MouseEvent): void {
    // ...
    this.onEvent?.();  // always undefined
  }
}

In plain JS, source.onEvent = fn on an instance creates an own property that any later method call would see through this. We didn't get that — somewhere in our toolchain (TypeScript with useDefineForClassFields, plus a class-fields transform in the build pipeline), the assignment landed somewhere the class methods didn't read from. The exact mechanism was build-config specific, and chasing it stopped being interesting once we found the one-line fix:

// FIXED — explicit declaration
class DomActivitySource implements ActivitySource {
  onEvent?: () => void;    // this line fixes the bug
  private events: ActivityEvent[] = [];
  // ...
}

Declaring the field made initialization deterministic and the bug went away across every build target. What made it hard to find in the first place: handleClick works perfectly when you call it directly in a unit test. It only failed in production, where the method was bound and invoked by document.addEventListener.

General lesson: When you implement an interface with optional callback properties, declare them explicitly on the class. Don't rely on whatever your transpiler does with externally-assigned-but-undeclared properties — the semantics are subtle enough that an explicit field is the safer default. One line of TypeScript is cheaper than three days of debugging.

Token-Efficient Activity Traces for LLMs

A raw JSON array of 20 events runs ~600 tokens once you account for repeated keys, escaped quotes, and structural noise. We compress it to ~300 by serializing to compact natural language:

RECENT USER ACTIONS (newest first):
- 14:32:05 click [button] "Edit Item" (in row: Blue Widget SKU-1234)
- 14:31:58 click [link] "Blue Widget" (in row: Blue Widget SKU-1234)
- 14:31:45 click [button] "Apply" (in toolbar: Filters)
- 14:31:30 click [checkbox] "Active" (in group: Lifecycle filter)

The non-obvious choice in this format is row context, not landmark context. For data-table apps, "in row: Blue Widget SKU-1234" is far more useful than "in region: main content." The row context tells the LLM which data record the interaction was about. We extract it by walking up to the nearest <tr> or [role="row"] and joining the cell text with pipes (truncated to 60 chars). The LLM almost always asks about that one row — and now it knows which one.

The other choices are more obvious in hindsight: HH:MM:SS timestamps over ISO strings (session-relative time is enough, saves ~15 chars per line); [button] / [link] role brackets matching the same taxonomy the a11y tree uses; newest-first ordering so the most-likely-relevant action is at the top.

The buffer. 20 events in a circular ring, oldest evicted first, reset on page navigation. At a leisurely 1 click every 2-5 seconds, 20 covers ~40-100 seconds — enough for a complete workflow (filter, search, click item, browse popup, close popup). Real users sometimes burst faster than that; we cover that case in the limitations section.

Token cost. A typical event line tokenizes to ~15-25 tokens (timestamps and quoted names split poorly across tokenizers). A full 20-event buffer is ~300-500 tokens; popup snapshots add another ~200-500 each. Total overhead is roughly 8-15% on top of the existing page context payload — small per request, worth multiplying through your DAU × queries before calling it "negligible" at scale.

The format is injected into the LLM prompt with a framing instruction:

--- Recent User Activity ---
Use this to understand the user's intent and which item/element they are asking about.
<user_activity>
RECENT USER ACTIONS (newest first):
- 14:32:05 click [button] "Edit Item" (in row: Blue Widget SKU-1234)
...
</user_activity>

The framing line ("Use this to understand...") matters more than its character count suggests. In our eval set, removing it noticeably increased the rate at which the model ignored the activity section entirely and answered the page-snapshot question instead of the per-user-action question. Treat the framing as load-bearing, not decorative.

The Full Pipeline

End to end, one click becomes one line in a prompt:

click → capture-phase listener → extractClickDescriptor() → ring buffer (20)
      → onEvent → setState → React Context → chat UI on send
      → POST with userActivity → backend NL serializer → <user_activity> tag in prompt

Each arrow above is either a synchronous in-process call or a single React render. The whole path runs in well under a frame; no network hops on the FE side.

Before vs After: What the Assistant Says

Back to the opening sequence — user clicks "Blue Widget", a popup with item details appears, user closes it, then opens the assistant. Here's what the LLM produces with and without the activity trace, given the same page state at message-send time (the table is on screen; the popup is gone):

Without activity trace:

User: what's wrong with this item?

Assistant: I can see your catalog has 25 items. Could you tell me which item you're asking about? You can click an item's row to see details.

With activity trace (~300 extra tokens in the prompt):

User: what's wrong with this item?

Assistant: You recently opened the details popup for Blue Widget (SKU-1234). Its current status is Suppressed and its price is $24.99. I can see the item's basic fields, but the popup didn't tell me why it's suppressed — do you want me to check the listing-quality details, or is there a specific error message you saw?

Same page, same model, same prompt template. The only difference is 8 lines of recent-click history plus the popup's a11y snapshot riding in the payload. The agent went from "which item?" to "I know exactly which item and what state it was in." Figuring out why it's suppressed still requires a separate lookup (a RAG hit or a tool call) — the activity trace gives the agent the what, not the why.

What I'd Do Differently

Declare every interface property on the class, not just required ones. TypeScript optional properties are tempting to leave undeclared, but the runtime semantics under various class-field transforms are subtle enough that an explicit field declaration is the safer default. Cost: one line. Benefit: a category of bugs you'll never see.

Design the chat UI as an observer of the activity tracker, not a coordinator. Our first cut had the chat UI calling into the activity layer to request a fresh trace at send time, which coupled the two MFEs and introduced a "when is the trace ready?" synchronization problem. The cleaner pattern that emerged: the activity tracker pushes updates whenever something interesting happens; the chat UI just reads whatever state is current at send time. One-way data flow, no cross-MFE coordination. If we were starting over, that's the contract we'd nail down on day one.

Treat MF integration as observation by policy, not capability. Module Federation remotes share the host's document and global scope. Observation (event listeners, mutation observers) is trivial. Interception (patching window.fetch, wrapping a shared store, replacing a singleton) is also technically possible — it just couples your remote to the host's internals and breaks the moment the host changes implementation. We adopted "observers only" as a policy after almost building a fetch-patching debug helper that would have rotted within a release. The runtime won't enforce this for you.

Limitations You Should Know About

This is the section the post would lose credibility without. The pattern works well for our setup, but it has real gaps.

Shadow DOM is invisible. A capture-phase listener on document doesn't see clicks inside closed shadow roots, and even open shadow roots retarget the event target at the shadow boundary. The walk-up parentElement chain also stops at the boundary. If the host page uses Web Components — LitElement, Stencil, anything that mounts a design system into shadow DOM, or Salesforce Lightning Web Components — your tracker will miss most user interactions inside those components. The workaround is to read event.composedPath() instead of event.target and walk that array, but you still can't reach into a closed shadow root from the outside.

The role allowlist captures keystroke targets. textbox, searchbox, and combobox are in ACTIONABLE_ROLES, which means a click on an input followed by typing is recorded with the input's accessible name. If a user's row data — names, addresses, account numbers — appears in the row context we attach to each event, that data is now in the activity trace and ends up in the LLM prompt. Three implications:

PII / PHI / PCI: if your dashboard shows regulated data, the activity trace will exfiltrate it to whatever model provider you're using. For PHI/PCI environments, this pattern likely isn't deployable without on-prem inference and a redaction layer.
Passwords: <input type="password"> doesn't usually carry role="textbox", but a custom component that wraps a password input can. Maintain an explicit denylist (e.g., any field inside [data-sensitive] or matching input[type=password]) and skip both the click and the row-context extraction.
Default to opt-in for row context: rather than always capturing sibling-cell text, mark tables that are safe with data-track-row-context="true". Anything else gets the click event without the row content.

Bursty interactions overflow the buffer. The 20-event ring buffer assumes a leisurely 1 click every 2-5 seconds. Power users clearing 19 filter chips, or anyone doing bulk-select-then-action, can flush the buffer in under 3 seconds. The 300ms debounce only catches identical-element repeats, not bursty distinct events. If your app has these workflows, consider coalescing repeats ("clicked 8 filter chips: A, B, C…") or bumping the buffer to 50.

The "host-app exposure" assumption. Capturing events from a Module Federation remote works because the remote shares the host's document. Same-origin iframes can do this via parent.document. Cross-origin iframes need a postMessage bridge the host opts into. Chrome extension content scripts get DOM access automatically but live in an isolated JS world, so you lose React-fiber introspection. The pattern generalizes; the transport doesn't.

When This Pattern Applies

The core pattern — capture-phase listener, role-based filtering, circular buffer, compact NL serialization — works for AI agents embedded in role-rich, light-DOM SaaS UIs. Practical constraints:

You need DOM access to the host page. Module Federation in the same document is easiest. Same-origin iframes work via parent.document. Cross-origin iframes need a postMessage bridge. Pure Shadow-DOM hosts (Salesforce Lightning, heavy Web Components apps) need a different approach entirely.
You need semantic roles on interactive elements. ARIA roles, or at minimum semantic HTML (<button>, <a>, <input>), give you the vocabulary to describe what was clicked. Role-less sites (some Webflow / CMS-generated pages) won't yield much.
You need a framing instruction in the LLM prompt. Without explicit guidance, models will treat the trace as metadata.
You need a redaction story for any field that could carry PII, PHI, or PCI data. Default-opt-in is the wrong default for regulated environments.

If you're building something similar, I'd love to hear how you bridged the synchronous-event-meets-async-React-state gap. We went with a callback into setState; the obvious alternatives are a Zustand/Redux store the chat UI subscribes to, an RxJS subject, or a window-scoped event bus. Which would you have picked?

The accessibility tree gave the agent spatial awareness — what is on the page. The activity trace gives it temporal awareness — what the user was doing. Together, they let the agent answer the question every user actually asks: "I was just doing something — help me with that."

Giving AI Agents Eyes (Part 1): 6 Tricks for Reading Web Pages Without Vision Models

Gaurav Chodwadia — Mon, 13 Apr 2026 22:09:45 +0000

There's a growing class of AI agents that don't browse the web autonomously — they live inside web applications. Figma AI lives inside the design tool. Notion AI lives inside the document editor. GitHub Copilot lives inside the IDE. And increasingly, enterprise SaaS platforms are embedding AI assistants directly into their dashboards.

These in-app agents face a problem that standalone chatbots don't: the user is staring at a complex UI — data tables, status badges, filters, modals — and expects the agent to see it too. Not a screenshot. Not a URL. The actual semantic content of what's on screen.

We hit this building an AI assistant for a large enterprise admin dashboard with dozens of page types. The assistant knew which page the user was on (via the URL), but had zero visibility into what was displayed on that page. The user asks "which items have errors?" while looking at a table of 25 items — and the agent has no idea.

The breakthrough came from an unlikely — but in hindsight, obvious — source: accessibility trees.

The Representation Problem

Before diving into the solution, let's frame the problem. You need to give an LLM a representation of what's currently on a web page. Your options:

Raw HTML — Feeding the DOM directly to an LLM is like handing someone the source code of a novel and asking them to summarize the plot. A typical admin dashboard page is ~150 KB of HTML. That's ~37,500 tokens — most of it CSS classes, data attributes, wrapper divs, and structural noise the model has to wade through.

Screenshots — Vision models have gotten remarkably good at reading pages, but they still struggle with dense data tables (25+ rows), and they can't see what's in the DOM: ARIA attributes like haspopup and expanded, disabled states that aren't visually distinct, or which element has focus. They also cost ~1,500-3,000 tokens per image and add capture latency. For data-heavy enterprise UIs, text representations are more reliable and cheaper.

Markdown conversion — Tools like Turndown can convert HTML to Markdown. It's readable, but you lose all interactive state (which button is disabled? which tab is selected? which checkbox is checked?). And it still runs 3-5x more tokens than what we ended up with.

DOM-to-JSON — Serializing the DOM tree to JSON preserves structure but is absurdly verbose. Even pruned, a typical page produces ~20-50 KB of JSON. The LLM has to navigate nested objects full of <div> wrappers that carry zero semantic meaning.

The winner? None of the above. The answer was hiding in plain sight — in the same tree that screen readers have used for decades.

Why Accessibility Trees

An accessibility tree is a parallel representation of the DOM that browsers maintain for screen readers. Its entire purpose is to answer the question: how do you describe this page to someone who can't see it? It strips away visual styling and structural noise, keeping only what matters: roles, names, states, and values.

That's exactly the question we're asking on behalf of an LLM. Screen readers and language models need the same thing — a compact, semantic, text-based description of what's on screen. The a11y tree has been solving this problem for decades. We just pointed it at a different consumer.

Here's what our a11y tree output looks like for an items catalog page:

[heading level=1] Catalog
[tab selected] All
[tab] Unpublished (39746)
[table]
  [row] [columnheader] Item Name | [columnheader] SKU | [columnheader] Status | [columnheader] Price
  [row] [cell] Laptop Pro 15 | [cell] sku-0082 | [cell] Unpublished | [cell] $299.00
  [row] [cell] Widget B | [cell] sku-1234 | [cell] Published | [cell] $49.99
[status] Showing 1-25 of 114,827 items

That's ~750-1,250 tokens for a page that would be ~37,500 tokens as raw HTML. A 30-50x reduction with zero information loss for what the LLM actually needs.

This isn't a novel idea. Playwright MCP uses accessibility snapshots for its browser_snapshot tool. Claude's Chrome extension uses a DOM walker for its read_page tool. AgentOccam showed that plain a11y trees match or beat vision-augmented approaches on web agent benchmarks. The a11y tree is emerging as the standard representation for giving LLMs page comprehension — and for good reason.

But most implementations stop at "just use the a11y tree." What follows isn't a collection of novel inventions — role classification is how browsers already compute the tree, name resolution is the W3C spec, and table deduplication is common sense once you see the token waste. Individually, none of these are surprising. But nobody writes down the full list of things you actually need to handle before an a11y tree works reliably in production. We learned each of these the hard way, so you don't have to.

Trick 1: Role Classification Eliminates Div Soup

Modern web apps are drowning in wrapper divs. A single button might be nested 10-20 levels deep:

<div class="wrapper">
  <div class="inner">
    <div class="container">
      <div class="btn-group">
        <button>Edit Item</button>
        <button>Delete Item</button>
      </div>
    </div>
  </div>
</div>

If you naively walk the DOM and emit every element, your output is mostly indentation and empty wrapper lines. The fix is a three-way role classification system:

Leaf roles (terminal nodes — extract name, stop recursing): button, link, textbox, checkbox, radio, img, switch, slider, menuitem

Container roles (structural — recurse into children): table, dialog, navigation, form, grid, tablist, menu, region

Transparent containers (invisible wrappers — skip element, promote children): any <div>, <span>, or element with no semantic role

With transparent container promotion, the deeply nested example above collapses to:

[button] Edit Item
[button] Delete Item

The leaf/container distinction is equally important. When the walker hits a <button>, it extracts the button's accessible name and stops. It doesn't descend into the button's inner <span> + <svg> icon structure to produce noise. The accessible name already captures what matters.

This three-way classification is what lets us fit a complex admin page into ~500 nodes and ~1,000 tokens. Without it, you'd blow through any reasonable token budget on structural noise alone.

Trick 2: Name Resolution — Six Fallbacks Before Giving Up

Determining what to call each element is harder than it sounds. We use a priority chain that mirrors the W3C accessible name computation:

Priority	Source	Example
1	`aria-label`	`<button aria-label="Close dialog">X</button>` -> "Close dialog"
2	`aria-labelledby`	References another element's text by ID
3	`alt` attribute	`<img alt="Product photo">` -> "Product photo"
4	`<label for="...">`	Associated form label
5	`placeholder`	Input placeholder text
6	`title` attribute	Tooltip fallback
7	Text content	`<button>Save Changes</button>` -> "Save Changes"

The good news: most well-built apps work fine at priority 7. Buttons have visible text, headings have visible text, links have visible text. ARIA attributes are enhancements, not requirements. If your app uses a component library with semantic HTML (<button>, <table>, <h2>, <a href>), the walker assigns correct roles automatically — no ARIA needed.

Trick 3: Visual Cue Annotations

Here's a gap that surprised us. The a11y tree captures semantic structure perfectly — but it doesn't capture visual presentation. During testing, a user asked "what is this blue alert?" about an info banner. The LLM couldn't identify it because the a11y tree rendered it as plain text with no color or severity metadata.

The same problem hits status badges ("Published" is green, "Error" is red), highlighted rows, and icon-only indicators. The user sees color-coded meaning; the LLM sees flat text.

The solution: a static map of CSS selectors to semantic annotations, checked via Element.matches() during the tree walk.

const VISUAL_CUE_MAP: Array<[string, string]> = [
  ['.alert-info, .banner--info',         'visual=blue-info-banner'],
  ['.alert-warning, .banner--warning',   'visual=yellow-warning-banner'],
  ['.alert-error, .banner--error',       'visual=red-error-banner'],
  ['.badge--success, .status--active',   'visual=green-badge'],
  ['.badge--error, .status--error',      'visual=red-badge'],
];

function getVisualCue(el: Element): string | null {
  for (const [selector, annotation] of VISUAL_CUE_MAP) {
    if (el.matches(selector)) return annotation;
  }
  return null;
}

The enriched output:

[region visual=blue-info-banner] This item requires attention. Review the listing.
[cell visual=green-badge] Published
[cell visual=red-badge] Error

Zero runtime cost (CSS selector matching is near-instant), fully deterministic, and the LLM prompt can explain what each annotation means. The trade-off is maintaining the selector map as the design system evolves — but that's a small price for giving the LLM color awareness.

Token impact is negligible: ~2-5 extra tokens per annotated element, ~20-100 total per page.

Trick 4: Hidden Controls Discovery via ARIA Hints

Here's a problem unique to rich web applications: many critical controls are hidden. Dropdown menus, popup editors, modals, side drawers — they only exist in the DOM after a trigger element is clicked. The a11y tree captures the trigger but not what it opens.

On a single catalog page, we found 9 distinct hidden control types: item detail popups, inline price editors, action menus (edit/retire/delete), shipping configuration panels, lifecycle filters, price range filters, fulfillment type filters, sort drawers, and a full filter panel with 19 expandable sections.

The a11y tree sees: [button collapsed] $ 100.33. It doesn't know that clicking it opens a pricing editor with competitive pricing data, a base price input, and Apply/Close buttons.

The partial solution comes from ARIA attributes that are already in the DOM:

[button collapsed haspopup=menu] $ 100.33
[button collapsed haspopup=dialog] --
[button collapsed haspopup=listbox] Lifecycle

aria-haspopup tells you something is behind the button. aria-controls can reference the target element by ID. The LLM now knows enough to say "click the price value — it opens a pricing menu" instead of giving generic instructions.

For high-value pages, we layer a static action catalog on top — a JSON registry mapping trigger types to available actions:

const KNOWN_ACTIONS = {
  "price-editor": {
    trigger: "Click price cell in table",
    contains: ["Base price (editable)", "Competitive price",
               "Buy Box price", "Active pricing programs"],
    actions: ["Update base price", "View competitive pricing"]
  },
  "actions-menu": {
    trigger: "Click three-dot icon in row",
    contains: ["Edit item", "Retire item", "Delete item"],
    actions: ["Edit", "Retire from marketplace", "Delete permanently"]
  }
};

The ARIA enrichment is automatic (works on every page). The action catalog is manual but provides specifics for the pages that matter most. Together, they bridge the gap between "I see a clickable element" and "I know what it does."

Trick 5: The Stale Snapshot Problem

This one bit us hard. The a11y tree is captured at a point in time — but if you capture at page load, you get the loading state. Skeleton screens. Spinner text. "Loading..." placeholders.

Here's the timeline of the bug:

T=0ms     User navigates to /catalog
T=200ms   Page shell renders (skeleton UI)
T=300ms   Data fetch fires (GET /api/items)
T=1000ms  A11y tree captured -> gets "Loading..." skeleton
T=1500ms  API response arrives -> React renders actual table
T=2000ms  User asks "What items do I have?"
T=2000ms  Message sent with stale "Loading..." snapshot

Our initial approach was a 1000ms debounce after navigation plus a MutationObserver that re-extracted on significant DOM changes (5+ added/removed nodes). But the MutationObserver had its own 1500ms debounce, and by the time it fired, the context had already been sent.

The fix was conceptually simple: re-extract at the moment the user sends a message, not at page load. When the user hits Send, the frontend captures a fresh a11y tree snapshot synchronously (~5-10ms on 500 nodes) and attaches it to the message payload. The snapshot is always current because it reflects exactly what the user sees when they ask their question.

We kept the background extraction as a pre-cache for proactive features, but the on-send extraction always wins for message context. The MutationObserver still monitors for table row additions (a good heuristic for "data just loaded") to keep the background cache fresh, but it's no longer the critical path.

Trick 6: Structured Data Extraction and Table Deduplication

The a11y tree handles layout and UI state well, but for data tables it represents values positionally — the LLM has to count cells to figure out which column a value belongs to. Ask "what's the price of Laptop Pro 15?" and the model needs to count across: Item Name, SKU, Status, Price. For a table with 25 rows and 11 columns, this is error-prone.

The fix: for pages with data tables, extract a parallel structured data representation — read thead for headers, map each tbody row by position, and output clean JSON with explicit header-to-value mapping:

{"headers": ["Item Name", "SKU", "Status", "Price"],
 "data": [
   {"Item Name": "Laptop Pro 15", "SKU": "sku-0082", "Status": "Unpublished", "Price": "$299.00"},
   {"Item Name": "Widget B", "SKU": "sku-1234", "Status": "Published", "Price": "$49.99"}
 ]}

Now the LLM doesn't count — it reads "Price": "$299.00" directly.

But this creates a duplication problem. The table data now appears in both the a11y tree and the structured JSON. On a catalog page with 25 rows, that wastes ~400-700 tokens — 35-40% of the combined payload.

The fix is conditional exclusion: when the structured data extractor succeeds, skip table-related roles during the a11y tree walk.

const TABLE_ROLES = new Set([
  'table', 'rowgroup', 'row', 'columnheader',
  'cell', 'rowheader', 'gridcell'
]);

if (structuredData) {
  a11yResult = buildA11yTreeSnapshot(root, { excludeRoles: TABLE_ROLES });
} else {
  a11yResult = buildA11yTreeSnapshot(root); // full tree as fallback
}

The a11y tree shrinks from 7,618 chars to 505 chars (93% reduction). Total prompt length drops 35%. The table data lives exclusively in the structured JSON where the LLM has explicit header-to-value mapping — no positional counting needed.

The key is making it conditional. Pages without a registered extractor still get the full a11y tree as their only table representation. The skip only activates when structured data provides a superior alternative.

Putting It All Together: The Prompt Template

Every trick above feeds into one thing: the system prompt the LLM actually sees. Here's what the assembled prompt looks like for a data table page (sanitized from our production template):

You are an AI assistant for [platform name].

The user is on a page in the platform. Below is a description of what's
currently visible on their screen.

--- Structured Data (machine-readable table data) ---
Use this for data questions (counting, comparing values, filtering):
<structured_data>
{"headers": ["Item Name", "SKU", "Status", "Price"],
 "data": [
   {"Item Name": "Laptop Pro 15", "SKU": "sku-0082", "Status": "Unpublished", "Price": "$299.00"},
   {"Item Name": "Widget B", "SKU": "sku-1234", "Status": "Published", "Price": "$49.99"}
 ]}
</structured_data>

--- Page Content (layout and UI elements) ---
Use this for layout/navigation questions (where is X, what buttons exist):
<page_content>
[heading level=1] Catalog
[tab selected] All
[tab] Unpublished (39746)
[button collapsed haspopup=menu] Filters
[button collapsed haspopup=menu] Sort
[searchbox] Search items
[status] Showing 1-25 of 114,827 items
</page_content>

--- Hidden Controls (not visible in page content above) ---
These controls exist but appear only after clicking a trigger element.
<hidden_controls>
- Click any item name to open a detail popup showing full item info,
  images, and status history
- Click a price cell to open a pricing editor with base price,
  competitive pricing, and Buy Box data
- Click the three-dot icon on any row for actions: Edit, Retire, Delete
- Click "Filters" to expand 19 filter sections: Lifecycle, Price Range,
  Fulfillment Type, etc.
- Click "Sort" to choose: Item Name, Price, Status, Date Created
- Type in the search box to filter items by name, SKU, or GTIN
</hidden_controls>

Page Name: Catalog

A few things to notice:

Two representations, not one. Structured data (JSON) handles precise data questions — "what's the cheapest item?" requires comparing values, which JSON makes trivial. The a11y tree (text) handles spatial questions — "what tabs are available?" or "is there a search box?" The LLM is told which to use for what.

Section headers are instructions. "Use this for data questions" and "Use this for layout questions" aren't decorative — they steer the model's attention. Without them, the LLM sometimes ignores the structured data and tries to answer data questions from the a11y tree text, which requires positional counting and fails on large tables.

The a11y tree is deduplicated. Notice the page content section has no table rows — those live exclusively in the structured JSON. This is Trick 6 in action, saving ~35% of the combined token budget.

Hidden controls fill the gap. The a11y tree shows [button collapsed haspopup=menu] Filters but can't describe what's inside. The hidden controls section tells the LLM there are 19 filter sections — so it can say "click Filters to access Lifecycle, Price Range..." instead of "there's a Filters button."

Total cost: ~1,500-2,500 tokens for a page that would be ~37,500 tokens as raw HTML. That's the 30-50x reduction with full context preserved — structured data for precision, a11y tree for layout, hidden controls for discoverability.

This is the extraction side of the problem. How you use this context — query classification, routing, RAG enrichment — is up to your agent architecture.

What I'd Do Differently

Start with on-send extraction, not page-load extraction. We spent cycles debugging stale snapshot timing issues that would have been avoided entirely by capturing at message-send time from the start.

Build the structured data extractor as generic, not page-specific. Our first extractor was custom for one page type. But the logic — read headers from thead th, map row cells by position — works on any standard HTML table. A generic table auto-detector would have covered 80% of pages with zero per-page work.

Don't skip SVGs entirely. We initially skipped all <svg> elements as "visual-only." But many convey meaning — checkmarks, warning triangles, info circles. Checking for aria-label, parent labels, and <title> elements recovers semantic meaning from icons that would otherwise produce zero output.

The Numbers

For a typical data table page with 25 rows:

Representation	Size	Tokens
Raw HTML	~150 KB	~37,500
DOM JSON (pruned)	~20-50 KB	~5,000-12,500
Markdown	~10-20 KB	~2,500-5,000
A11y tree	~3-5 KB	~750-1,250
A11y tree + structured JSON	~5-8 KB	~1,250-2,000
A11y tree (deduplicated) + JSON	~3-5 KB	~800-1,250

The a11y tree approach gives us a 30-50x reduction over raw HTML while preserving everything the LLM needs: semantic roles, interactive states, element names, and current values. The deduplication trick shaves another 35% when structured extractors are available.

Extraction takes <10ms on the main thread for 500 nodes. No external dependencies. No vision models. No API calls. Just a recursive DOM walk using the same principles screen readers have relied on for years.

DEV Community: Gaurav Chodwadia

Giving AI Agents Eyes (Part 2): From Page Snapshots to Interaction Traces

Strategy Pattern for Extensible Event Capture

Capture-Phase Click Listeners

Three Bugs That Compounded

Token-Efficient Activity Traces for LLMs

The Full Pipeline

Before vs After: What the Assistant Says

What I'd Do Differently

Limitations You Should Know About

When This Pattern Applies

Giving AI Agents Eyes (Part 1): 6 Tricks for Reading Web Pages Without Vision Models

The Representation Problem

Why Accessibility Trees

Trick 1: Role Classification Eliminates Div Soup

Trick 2: Name Resolution — Six Fallbacks Before Giving Up

Trick 3: Visual Cue Annotations

Trick 4: Hidden Controls Discovery via ARIA Hints

Trick 5: The Stale Snapshot Problem

Trick 6: Structured Data Extraction and Table Deduplication

Putting It All Together: The Prompt Template

What I'd Do Differently

The Numbers

Further Reading