<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gaurav Chodwadia</title>
    <description>The latest articles on DEV Community by Gaurav Chodwadia (@gauravchodwadia).</description>
    <link>https://dev.to/gauravchodwadia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3872536%2F21d72353-305d-4477-b6d0-a63729b674aa.jpeg</url>
      <title>DEV Community: Gaurav Chodwadia</title>
      <link>https://dev.to/gauravchodwadia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gauravchodwadia"/>
    <language>en</language>
    <item>
      <title>Giving AI Agents Eyes (Part 2): From Page Snapshots to Interaction Traces</title>
      <dc:creator>Gaurav Chodwadia</dc:creator>
      <pubDate>Mon, 18 May 2026 17:50:19 +0000</pubDate>
      <link>https://dev.to/gauravchodwadia/giving-ai-agents-eyes-part-2-from-page-snapshots-to-interaction-traces-24hd</link>
      <guid>https://dev.to/gauravchodwadia/giving-ai-agents-eyes-part-2-from-page-snapshots-to-interaction-traces-24hd</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/gauravchodwadia/giving-ai-agents-eyes-part-1-6-tricks-for-reading-web-pages-without-vision-models-4193"&gt;Part 1&lt;/a&gt;, we solved the representation problem: how to give an LLM a compact, semantic view of a web page using accessibility trees. That gave our AI agent the ability to answer "what is on this page?"&lt;/p&gt;

&lt;p&gt;But users don't ask that. They ask "what did I just click?" and "what was in that popup I closed?" Here's the gap, concretely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T=0s    User clicks "Blue Widget" in a data table
T=1s    Popup appears with item details (price, status, GTIN)
T=3s    User closes popup
T=5s    User opens AI assistant, asks "what's wrong with this item?"
T=5s    Assistant captures page state: sees 25 items in table, no popup
T=5s    Assistant has zero context about which item or what was in the popup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A page snapshot is a photograph. The user is asking about a video. The agent needs two things it doesn't have: (1) a log of recent interactions (the user clicked "Blue Widget"), and (2) snapshots of ephemeral UI that no longer exists (the popup that showed price $24.99 and status Published).&lt;/p&gt;

&lt;p&gt;Session replay tools (rrweb, FullStory, LogRocket) solve a related problem, but they produce DOM serialization data designed for visual playback — not semantic descriptions for LLM consumption. You need ~200 tokens of natural language, not 200 KB of mutation records.&lt;/p&gt;

&lt;p&gt;This post covers how we built a user-activity tracker from inside a Module Federation remote that doesn't own the host page, three bugs that compounded into a "one behind" symptom, and how we serialize 20 interaction events into a few hundred tokens of LLM-friendly text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy Pattern for Extensible Event Capture
&lt;/h2&gt;

&lt;p&gt;Click events are one source — but probably not the only one we'll ever want. The host app might later emit structured events via a PubSub system; a future host could fire shell-level navigation events that we'd want to record too. We didn't want to rewrite the tracker the day either of those landed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                     ActivitySource (interface)
                    /         |          \
                   /          |           \
        DomActivitySource  HostEventSource  HybridSource
        (capture-phase       (PubSub from    (both, with
         click listener)      host app)       priority rules)
                   \          |          /
                    \         |         /
                     useActivityTracker (hook)
                             |
                     Feature flag selects source at runtime:
                     "dom" | "host" | "hybrid" | "off"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The interface is minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;ActivitySource&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;getTrace&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="nx"&gt;UserActivityTrace&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nf"&gt;clear&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// callback when a new event is captured&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sources are plain classes; the React layer is one thin hook (&lt;code&gt;useActivityTracker&lt;/code&gt;) that owns construction and cleanup. Hooks would have worked too — this isn't a religious choice. We went with classes for three pragmatic reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Composability without React.&lt;/strong&gt; &lt;code&gt;HybridSource&lt;/code&gt; creates a &lt;code&gt;DomActivitySource&lt;/code&gt; and &lt;code&gt;HostEventSource&lt;/code&gt; internally and merges their outputs. Doable with hooks, but composition requires wrapper components or hook-forwarding patterns; instantiating two class instances and merging is simpler.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Testability without rendering.&lt;/strong&gt; You can instantiate a &lt;code&gt;DomActivitySource&lt;/code&gt; in a unit test with jsdom, call &lt;code&gt;start()&lt;/code&gt;, simulate clicks, and assert on &lt;code&gt;getTrace()&lt;/code&gt; — no &lt;code&gt;renderHook&lt;/code&gt;, no &lt;code&gt;act()&lt;/code&gt;, no React tree.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Future-proofing for non-React consumers.&lt;/strong&gt; If the tracker ever needs to run in a worker, a vanilla shell, or a non-React MFE, a plain class moves over unchanged.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Capture-Phase Click Listeners
&lt;/h2&gt;

&lt;p&gt;Our AI assistant is loaded into a large SaaS dashboard as a &lt;a href="https://module-federation.io/" rel="noopener noreferrer"&gt;Module Federation&lt;/a&gt; remote. It's not an iframe — it shares the same &lt;code&gt;document&lt;/code&gt; as the host application. This is the architectural fact that makes activity tracking possible.&lt;/p&gt;

&lt;p&gt;Because we share the DOM, a single listener captures every click on the host page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEventListener&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;click&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;handleClick&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;capture&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The capture phase (&lt;code&gt;{ capture: true }&lt;/code&gt;) is important. It fires before the target element's own click handlers, before any bubble-phase &lt;code&gt;stopPropagation()&lt;/code&gt; calls in the host app can prevent us from seeing the event. (Caveat: a host-app listener installed earlier on &lt;code&gt;document&lt;/code&gt; in the same phase that calls &lt;code&gt;stopImmediatePropagation()&lt;/code&gt; will still hide events from us. We've never seen this in the wild — most apps only &lt;code&gt;stopPropagation()&lt;/code&gt; from inside component handlers, which is bubble-phase — but it's the one way the host can blind us if they want to.)&lt;/p&gt;

&lt;p&gt;But we don't want to log every click. A click on a wrapper div, a scroll container, or a decorative icon is noise. The &lt;code&gt;extractClickDescriptor&lt;/code&gt; function walks up from the event target to find the nearest &lt;em&gt;actionable&lt;/em&gt; element:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ACTIONABLE_ROLES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;button&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;link&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;menuitem&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;tab&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;checkbox&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;radio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;combobox&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;textbox&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;searchbox&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;switch&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;row&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;findNearestInteractive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Element&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nx"&gt;Element&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Element&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;target&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Skip our own UI&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-assistant-container="true"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Respect opt-out attribute&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;[data-no-track]&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getRole&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;ACTIONABLE_ROLES&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;role&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;el&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;parentElement&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The walk-up pattern is essential because the actual click target is usually a child of the interactive element — a &lt;code&gt;&amp;lt;span&amp;gt;&lt;/code&gt; inside a &lt;code&gt;&amp;lt;button&amp;gt;&lt;/code&gt;, an &lt;code&gt;&amp;lt;svg&amp;gt;&lt;/code&gt; inside a link. The allowlist focuses on ARIA roles that represent user-initiated actions, which eliminates the vast majority of noise.&lt;/p&gt;

&lt;p&gt;Additional noise filters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Debounce&lt;/strong&gt; — Same element clicked within 300ms is suppressed (double-click)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Name required&lt;/strong&gt; — Elements with no accessible name and no text content are ignored&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container guard&lt;/strong&gt; — Clicks inside the AI panel itself are excluded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result on a typical data-table page: roughly one actionable event per real click (occasionally two when a row + nested button both qualify), out of dozens of raw click events the page absorbs. No firehose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Bugs That Compounded
&lt;/h2&gt;

&lt;p&gt;The symptom looks like one bug. It's three.&lt;/p&gt;

&lt;p&gt;Your activity trace is always one interaction behind. The user clicks Item A, asks the assistant, and the assistant either sees nothing or sees the &lt;em&gt;previous&lt;/em&gt; click. That's the single symptom we chased.&lt;/p&gt;

&lt;p&gt;What we actually had was three failure modes of the same underlying mismatch — synchronous DOM events meeting asynchronous React state. Each gave the bug somewhere to hide. Fixing any one of them in isolation didn't move the needle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The setup:&lt;/strong&gt; The remote MFE captures DOM events synchronously. The captured data flows into React state, which flows through React Context to a consumer component in a different MFE (the AI chat UI), which reads the context when the user sends a message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 1: Conditional trace inclusion.&lt;/strong&gt; Page-content extraction fires under three triggers — initial page load, significant DOM mutations, and route navigation. We initially only included the activity trace when the trigger was &lt;code&gt;"message-send"&lt;/code&gt;. The other triggers produced snapshots &lt;em&gt;without&lt;/em&gt; traces. React deduplication kept the traceless version in state, so the chat UI read an empty trace at send time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fix:&lt;/em&gt; Always include the trace in every extraction, regardless of trigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2: No extraction on click events.&lt;/strong&gt; Click events trigger none of the three extraction conditions above. The snapshot already in state has the trace from the &lt;em&gt;last&lt;/em&gt; extraction — which captured clicks before the latest one.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Fix:&lt;/em&gt; Add an &lt;code&gt;onEvent&lt;/code&gt; callback to the source interface. When a click is captured, fire the callback. In the page-content hook, the callback patches React state with the latest trace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In the page-content hook&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handleActivityEvent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;useCallback&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;getTraceRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nf"&gt;setSnapshot&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;userActivity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In the DOM source's click handler&lt;/span&gt;
&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unshift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;descriptor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;?.();&lt;/span&gt;  &lt;span class="c1"&gt;// patch React state immediately&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bug 3: Undeclared class property.&lt;/strong&gt; The &lt;code&gt;ActivitySource&lt;/code&gt; interface declares &lt;code&gt;onEvent?: () =&amp;gt; void&lt;/code&gt;, and the hook assigns it after construction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onEvent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;onEventRef&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;current&lt;/span&gt;&lt;span class="p"&gt;?.();&lt;/span&gt;
&lt;span class="nx"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But the class never &lt;em&gt;declared&lt;/em&gt; &lt;code&gt;onEvent&lt;/code&gt; as a field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BROKEN — onEvent is not declared on the class&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DomActivitySource&lt;/span&gt; &lt;span class="k"&gt;implements&lt;/span&gt; &lt;span class="nx"&gt;ActivitySource&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ActivityEvent&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="nf"&gt;handleClick&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;MouseEvent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;?.();&lt;/span&gt;  &lt;span class="c1"&gt;// always undefined&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In plain JS, &lt;code&gt;source.onEvent = fn&lt;/code&gt; on an instance creates an own property that any later method call would see through &lt;code&gt;this&lt;/code&gt;. We didn't get that — somewhere in our toolchain (TypeScript with &lt;code&gt;useDefineForClassFields&lt;/code&gt;, plus a class-fields transform in the build pipeline), the assignment landed somewhere the class methods didn't read from. The exact mechanism was build-config specific, and chasing it stopped being interesting once we found the one-line fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// FIXED — explicit declaration&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DomActivitySource&lt;/span&gt; &lt;span class="k"&gt;implements&lt;/span&gt; &lt;span class="nx"&gt;ActivitySource&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;onEvent&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;// this line fixes the bug&lt;/span&gt;
  &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ActivityEvent&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Declaring the field made initialization deterministic and the bug went away across every build target. What made it hard to find in the first place: &lt;code&gt;handleClick&lt;/code&gt; works perfectly when you call it directly in a unit test. It only failed in production, where the method was bound and invoked by &lt;code&gt;document.addEventListener&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;General lesson:&lt;/strong&gt; When you implement an interface with optional callback properties, declare them explicitly on the class. Don't rely on whatever your transpiler does with externally-assigned-but-undeclared properties — the semantics are subtle enough that an explicit field is the safer default. One line of TypeScript is cheaper than three days of debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token-Efficient Activity Traces for LLMs
&lt;/h2&gt;

&lt;p&gt;A raw JSON array of 20 events runs ~600 tokens once you account for repeated keys, escaped quotes, and structural noise. We compress it to ~300 by serializing to compact natural language:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RECENT USER ACTIONS (newest first):
- 14:32:05 click [button] "Edit Item" (in row: Blue Widget SKU-1234)
- 14:31:58 click [link] "Blue Widget" (in row: Blue Widget SKU-1234)
- 14:31:45 click [button] "Apply" (in toolbar: Filters)
- 14:31:30 click [checkbox] "Active" (in group: Lifecycle filter)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious choice in this format is &lt;strong&gt;row context, not landmark context&lt;/strong&gt;. For data-table apps, "in row: Blue Widget SKU-1234" is far more useful than "in region: main content." The row context tells the LLM &lt;em&gt;which data record&lt;/em&gt; the interaction was about. We extract it by walking up to the nearest &lt;code&gt;&amp;lt;tr&amp;gt;&lt;/code&gt; or &lt;code&gt;[role="row"]&lt;/code&gt; and joining the cell text with pipes (truncated to 60 chars). The LLM almost always asks about &lt;em&gt;that one row&lt;/em&gt; — and now it knows which one.&lt;/p&gt;

&lt;p&gt;The other choices are more obvious in hindsight: HH:MM:SS timestamps over ISO strings (session-relative time is enough, saves ~15 chars per line); &lt;code&gt;[button]&lt;/code&gt; / &lt;code&gt;[link]&lt;/code&gt; role brackets matching the same taxonomy the a11y tree uses; newest-first ordering so the most-likely-relevant action is at the top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The buffer.&lt;/strong&gt; 20 events in a circular ring, oldest evicted first, reset on page navigation. At a leisurely 1 click every 2-5 seconds, 20 covers ~40-100 seconds — enough for a complete workflow (filter, search, click item, browse popup, close popup). Real users sometimes burst faster than that; we cover that case in the limitations section.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token cost.&lt;/strong&gt; A typical event line tokenizes to ~15-25 tokens (timestamps and quoted names split poorly across tokenizers). A full 20-event buffer is ~300-500 tokens; popup snapshots add another ~200-500 each. Total overhead is roughly 8-15% on top of the existing page context payload — small per request, worth multiplying through your DAU × queries before calling it "negligible" at scale.&lt;/p&gt;

&lt;p&gt;The format is injected into the LLM prompt with a framing instruction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;--- Recent User Activity ---
Use this to understand the user's intent and which item/element they are asking about.
&lt;span class="nt"&gt;&amp;lt;user_activity&amp;gt;&lt;/span&gt;
RECENT USER ACTIONS (newest first):
- 14:32:05 click [button] "Edit Item" (in row: Blue Widget SKU-1234)
...
&lt;span class="nt"&gt;&amp;lt;/user_activity&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framing line ("Use this to understand...") matters more than its character count suggests. In our eval set, removing it noticeably increased the rate at which the model ignored the activity section entirely and answered the page-snapshot question instead of the per-user-action question. Treat the framing as load-bearing, not decorative.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Full Pipeline
&lt;/h2&gt;

&lt;p&gt;End to end, one click becomes one line in a prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;click → capture-phase listener → extractClickDescriptor() → ring buffer (20)
      → onEvent → setState → React Context → chat UI on send
      → POST with userActivity → backend NL serializer → &amp;lt;user_activity&amp;gt; tag in prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each arrow above is either a synchronous in-process call or a single React render. The whole path runs in well under a frame; no network hops on the FE side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before vs After: What the Assistant Says
&lt;/h2&gt;

&lt;p&gt;Back to the opening sequence — user clicks "Blue Widget", a popup with item details appears, user closes it, then opens the assistant. Here's what the LLM produces with and without the activity trace, given the same page state at message-send time (the table is on screen; the popup is gone):&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without activity trace:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;User: &lt;em&gt;what's wrong with this item?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Assistant: I can see your catalog has 25 items. Could you tell me which item you're asking about? You can click an item's row to see details.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;With activity trace (~300 extra tokens in the prompt):&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;User: &lt;em&gt;what's wrong with this item?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Assistant: You recently opened the details popup for &lt;strong&gt;Blue Widget (SKU-1234)&lt;/strong&gt;. Its current status is &lt;strong&gt;Suppressed&lt;/strong&gt; and its price is $24.99. I can see the item's basic fields, but the popup didn't tell me &lt;em&gt;why&lt;/em&gt; it's suppressed — do you want me to check the listing-quality details, or is there a specific error message you saw?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Same page, same model, same prompt template. The only difference is 8 lines of recent-click history plus the popup's a11y snapshot riding in the payload. The agent went from "which item?" to "I know exactly which item and what state it was in." Figuring out &lt;em&gt;why&lt;/em&gt; it's suppressed still requires a separate lookup (a RAG hit or a tool call) — the activity trace gives the agent the &lt;em&gt;what&lt;/em&gt;, not the &lt;em&gt;why&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Declare every interface property on the class, not just required ones.&lt;/strong&gt; TypeScript optional properties are tempting to leave undeclared, but the runtime semantics under various class-field transforms are subtle enough that an explicit field declaration is the safer default. Cost: one line. Benefit: a category of bugs you'll never see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design the chat UI as an observer of the activity tracker, not a coordinator.&lt;/strong&gt; Our first cut had the chat UI calling into the activity layer to request a fresh trace at send time, which coupled the two MFEs and introduced a "when is the trace ready?" synchronization problem. The cleaner pattern that emerged: the activity tracker pushes updates whenever something interesting happens; the chat UI just reads whatever state is current at send time. One-way data flow, no cross-MFE coordination. If we were starting over, that's the contract we'd nail down on day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat MF integration as observation by policy, not capability.&lt;/strong&gt; Module Federation remotes share the host's &lt;code&gt;document&lt;/code&gt; and global scope. Observation (event listeners, mutation observers) is trivial. Interception (patching &lt;code&gt;window.fetch&lt;/code&gt;, wrapping a shared store, replacing a singleton) is &lt;em&gt;also&lt;/em&gt; technically possible — it just couples your remote to the host's internals and breaks the moment the host changes implementation. We adopted "observers only" as a policy after almost building a fetch-patching debug helper that would have rotted within a release. The runtime won't enforce this for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations You Should Know About
&lt;/h2&gt;

&lt;p&gt;This is the section the post would lose credibility without. The pattern works well for our setup, but it has real gaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow DOM is invisible.&lt;/strong&gt; A capture-phase listener on &lt;code&gt;document&lt;/code&gt; doesn't see clicks inside closed shadow roots, and even open shadow roots retarget the event target at the shadow boundary. The walk-up &lt;code&gt;parentElement&lt;/code&gt; chain also stops at the boundary. If the host page uses Web Components — LitElement, Stencil, anything that mounts a design system into shadow DOM, or Salesforce Lightning Web Components — your tracker will miss most user interactions inside those components. The workaround is to read &lt;code&gt;event.composedPath()&lt;/code&gt; instead of &lt;code&gt;event.target&lt;/code&gt; and walk that array, but you still can't reach into a &lt;em&gt;closed&lt;/em&gt; shadow root from the outside.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The role allowlist captures keystroke targets.&lt;/strong&gt; &lt;code&gt;textbox&lt;/code&gt;, &lt;code&gt;searchbox&lt;/code&gt;, and &lt;code&gt;combobox&lt;/code&gt; are in &lt;code&gt;ACTIONABLE_ROLES&lt;/code&gt;, which means a click on an input followed by typing is recorded with the input's accessible name. If a user's row data — names, addresses, account numbers — appears in the row context we attach to each event, that data is now in the activity trace and ends up in the LLM prompt. Three implications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PII / PHI / PCI:&lt;/strong&gt; if your dashboard shows regulated data, the activity trace will exfiltrate it to whatever model provider you're using. For PHI/PCI environments, this pattern likely isn't deployable without on-prem inference and a redaction layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Passwords:&lt;/strong&gt; &lt;code&gt;&amp;lt;input type="password"&amp;gt;&lt;/code&gt; doesn't usually carry &lt;code&gt;role="textbox"&lt;/code&gt;, but a custom component that wraps a password input &lt;em&gt;can&lt;/em&gt;. Maintain an explicit denylist (e.g., any field inside &lt;code&gt;[data-sensitive]&lt;/code&gt; or matching &lt;code&gt;input[type=password]&lt;/code&gt;) and skip both the click and the row-context extraction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default to opt-in for row context:&lt;/strong&gt; rather than always capturing sibling-cell text, mark tables that are safe with &lt;code&gt;data-track-row-context="true"&lt;/code&gt;. Anything else gets the click event without the row content.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bursty interactions overflow the buffer.&lt;/strong&gt; The 20-event ring buffer assumes a leisurely 1 click every 2-5 seconds. Power users clearing 19 filter chips, or anyone doing bulk-select-then-action, can flush the buffer in under 3 seconds. The 300ms debounce only catches identical-element repeats, not bursty distinct events. If your app has these workflows, consider coalescing repeats ("clicked 8 filter chips: A, B, C…") or bumping the buffer to 50.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "host-app exposure" assumption.&lt;/strong&gt; Capturing events from a Module Federation remote works because the remote shares the host's &lt;code&gt;document&lt;/code&gt;. Same-origin iframes can do this via &lt;code&gt;parent.document&lt;/code&gt;. Cross-origin iframes need a &lt;code&gt;postMessage&lt;/code&gt; bridge the host opts into. Chrome extension content scripts get DOM access automatically but live in an isolated JS world, so you lose React-fiber introspection. The pattern generalizes; the transport doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  When This Pattern Applies
&lt;/h2&gt;

&lt;p&gt;The core pattern — capture-phase listener, role-based filtering, circular buffer, compact NL serialization — works for AI agents embedded in role-rich, light-DOM SaaS UIs. Practical constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need DOM access to the host page.&lt;/strong&gt; Module Federation in the same document is easiest. Same-origin iframes work via &lt;code&gt;parent.document&lt;/code&gt;. Cross-origin iframes need a postMessage bridge. Pure Shadow-DOM hosts (Salesforce Lightning, heavy Web Components apps) need a different approach entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need semantic roles&lt;/strong&gt; on interactive elements. ARIA roles, or at minimum semantic HTML (&lt;code&gt;&amp;lt;button&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;input&amp;gt;&lt;/code&gt;), give you the vocabulary to describe what was clicked. Role-less sites (some Webflow / CMS-generated pages) won't yield much.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need a framing instruction&lt;/strong&gt; in the LLM prompt. Without explicit guidance, models will treat the trace as metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need a redaction story&lt;/strong&gt; for any field that could carry PII, PHI, or PCI data. Default-opt-in is the wrong default for regulated environments.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you're building something similar, I'd love to hear how you bridged the synchronous-event-meets-async-React-state gap. We went with a callback into &lt;code&gt;setState&lt;/code&gt;; the obvious alternatives are a Zustand/Redux store the chat UI subscribes to, an RxJS subject, or a &lt;code&gt;window&lt;/code&gt;-scoped event bus. Which would you have picked?&lt;/p&gt;




&lt;p&gt;The accessibility tree gave the agent spatial awareness — &lt;em&gt;what is on the page&lt;/em&gt;. The activity trace gives it temporal awareness — &lt;em&gt;what the user was doing&lt;/em&gt;. Together, they let the agent answer the question every user actually asks: "I was just doing something — help me with &lt;em&gt;that&lt;/em&gt;."&lt;/p&gt;

</description>
      <category>ai</category>
      <category>a11y</category>
      <category>frontend</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Giving AI Agents Eyes (Part 1): 6 Tricks for Reading Web Pages Without Vision Models</title>
      <dc:creator>Gaurav Chodwadia</dc:creator>
      <pubDate>Mon, 13 Apr 2026 22:09:45 +0000</pubDate>
      <link>https://dev.to/gauravchodwadia/giving-ai-agents-eyes-part-1-6-tricks-for-reading-web-pages-without-vision-models-4193</link>
      <guid>https://dev.to/gauravchodwadia/giving-ai-agents-eyes-part-1-6-tricks-for-reading-web-pages-without-vision-models-4193</guid>
      <description>&lt;p&gt;There's a growing class of AI agents that don't browse the web autonomously — they live &lt;em&gt;inside&lt;/em&gt; web applications. Figma AI lives inside the design tool. Notion AI lives inside the document editor. GitHub Copilot lives inside the IDE. And increasingly, enterprise SaaS platforms are embedding AI assistants directly into their dashboards.&lt;/p&gt;

&lt;p&gt;These in-app agents face a problem that standalone chatbots don't: the user is staring at a complex UI — data tables, status badges, filters, modals — and expects the agent to see it too. Not a screenshot. Not a URL. The actual semantic content of what's on screen.&lt;/p&gt;

&lt;p&gt;We hit this building an AI assistant for a large enterprise admin dashboard with dozens of page types. The assistant knew &lt;em&gt;which page&lt;/em&gt; the user was on (via the URL), but had zero visibility into &lt;em&gt;what was displayed&lt;/em&gt; on that page. The user asks "which items have errors?" while looking at a table of 25 items — and the agent has no idea.&lt;/p&gt;

&lt;p&gt;The breakthrough came from an unlikely — but in hindsight, obvious — source: accessibility trees.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Representation Problem
&lt;/h2&gt;

&lt;p&gt;Before diving into the solution, let's frame the problem. You need to give an LLM a representation of what's currently on a web page. Your options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw HTML&lt;/strong&gt; — Feeding the DOM directly to an LLM is like handing someone the source code of a novel and asking them to summarize the plot. A typical admin dashboard page is ~150 KB of HTML. That's ~37,500 tokens — most of it CSS classes, data attributes, wrapper divs, and structural noise the model has to wade through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screenshots&lt;/strong&gt; — Vision models have gotten remarkably good at reading pages, but they still struggle with dense data tables (25+ rows), and they can't see what's in the DOM: ARIA attributes like &lt;code&gt;haspopup&lt;/code&gt; and &lt;code&gt;expanded&lt;/code&gt;, disabled states that aren't visually distinct, or which element has focus. They also cost ~1,500-3,000 tokens per image and add capture latency. For data-heavy enterprise UIs, text representations are more reliable and cheaper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Markdown conversion&lt;/strong&gt; — Tools like Turndown can convert HTML to Markdown. It's readable, but you lose all interactive state (which button is disabled? which tab is selected? which checkbox is checked?). And it still runs 3-5x more tokens than what we ended up with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DOM-to-JSON&lt;/strong&gt; — Serializing the DOM tree to JSON preserves structure but is absurdly verbose. Even pruned, a typical page produces ~20-50 KB of JSON. The LLM has to navigate nested objects full of &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; wrappers that carry zero semantic meaning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The winner? None of the above.&lt;/strong&gt; The answer was hiding in plain sight — in the same tree that screen readers have used for decades.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Accessibility Trees
&lt;/h2&gt;

&lt;p&gt;An accessibility tree is a parallel representation of the DOM that browsers maintain for screen readers. Its entire purpose is to answer the question: &lt;em&gt;how do you describe this page to someone who can't see it?&lt;/em&gt; It strips away visual styling and structural noise, keeping only what matters: &lt;strong&gt;roles&lt;/strong&gt;, &lt;strong&gt;names&lt;/strong&gt;, &lt;strong&gt;states&lt;/strong&gt;, and &lt;strong&gt;values&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That's exactly the question we're asking on behalf of an LLM. Screen readers and language models need the same thing — a compact, semantic, text-based description of what's on screen. The a11y tree has been solving this problem for decades. We just pointed it at a different consumer.&lt;/p&gt;

&lt;p&gt;Here's what our a11y tree output looks like for an items catalog page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[heading level=1] Catalog
[tab selected] All
[tab] Unpublished (39746)
[table]
  [row] [columnheader] Item Name | [columnheader] SKU | [columnheader] Status | [columnheader] Price
  [row] [cell] Laptop Pro 15 | [cell] sku-0082 | [cell] Unpublished | [cell] $299.00
  [row] [cell] Widget B | [cell] sku-1234 | [cell] Published | [cell] $49.99
[status] Showing 1-25 of 114,827 items
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's ~750-1,250 tokens for a page that would be ~37,500 tokens as raw HTML. A &lt;strong&gt;30-50x reduction&lt;/strong&gt; with zero information loss for what the LLM actually needs.&lt;/p&gt;

&lt;p&gt;This isn't a novel idea. &lt;a href="https://github.com/microsoft/playwright-mcp" rel="noopener noreferrer"&gt;Playwright MCP&lt;/a&gt; uses accessibility snapshots for its &lt;code&gt;browser_snapshot&lt;/code&gt; tool. Claude's Chrome extension uses a DOM walker for its &lt;code&gt;read_page&lt;/code&gt; tool. &lt;a href="https://arxiv.org/abs/2410.13825" rel="noopener noreferrer"&gt;AgentOccam&lt;/a&gt; showed that plain a11y trees match or beat vision-augmented approaches on web agent benchmarks. The a11y tree is emerging as the standard representation for giving LLMs page comprehension — and for good reason.&lt;/p&gt;

&lt;p&gt;But most implementations stop at "just use the a11y tree." What follows isn't a collection of novel inventions — role classification is how browsers already compute the tree, name resolution is the W3C spec, and table deduplication is common sense once you see the token waste. Individually, none of these are surprising. But nobody writes down the full list of things you actually need to handle before an a11y tree works reliably in production. We learned each of these the hard way, so you don't have to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 1: Role Classification Eliminates Div Soup
&lt;/h2&gt;

&lt;p&gt;Modern web apps are drowning in wrapper divs. A single button might be nested 10-20 levels deep:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"wrapper"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"inner"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"container"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"btn-group"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;button&amp;gt;&lt;/span&gt;Edit Item&lt;span class="nt"&gt;&amp;lt;/button&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;button&amp;gt;&lt;/span&gt;Delete Item&lt;span class="nt"&gt;&amp;lt;/button&amp;gt;&lt;/span&gt;
      &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you naively walk the DOM and emit every element, your output is mostly indentation and empty wrapper lines. The fix is a three-way role classification system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaf roles&lt;/strong&gt; (terminal nodes — extract name, stop recursing): &lt;code&gt;button&lt;/code&gt;, &lt;code&gt;link&lt;/code&gt;, &lt;code&gt;textbox&lt;/code&gt;, &lt;code&gt;checkbox&lt;/code&gt;, &lt;code&gt;radio&lt;/code&gt;, &lt;code&gt;img&lt;/code&gt;, &lt;code&gt;switch&lt;/code&gt;, &lt;code&gt;slider&lt;/code&gt;, &lt;code&gt;menuitem&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Container roles&lt;/strong&gt; (structural — recurse into children): &lt;code&gt;table&lt;/code&gt;, &lt;code&gt;dialog&lt;/code&gt;, &lt;code&gt;navigation&lt;/code&gt;, &lt;code&gt;form&lt;/code&gt;, &lt;code&gt;grid&lt;/code&gt;, &lt;code&gt;tablist&lt;/code&gt;, &lt;code&gt;menu&lt;/code&gt;, &lt;code&gt;region&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transparent containers&lt;/strong&gt; (invisible wrappers — skip element, promote children): any &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;span&amp;gt;&lt;/code&gt;, or element with no semantic role&lt;/p&gt;

&lt;p&gt;With transparent container promotion, the deeply nested example above collapses to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[button] Edit Item
[button] Delete Item
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The leaf/container distinction is equally important. When the walker hits a &lt;code&gt;&amp;lt;button&amp;gt;&lt;/code&gt;, it extracts the button's accessible name and stops. It doesn't descend into the button's inner &lt;code&gt;&amp;lt;span&amp;gt;&lt;/code&gt; + &lt;code&gt;&amp;lt;svg&amp;gt;&lt;/code&gt; icon structure to produce noise. The accessible name already captures what matters.&lt;/p&gt;

&lt;p&gt;This three-way classification is what lets us fit a complex admin page into ~500 nodes and ~1,000 tokens. Without it, you'd blow through any reasonable token budget on structural noise alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 2: Name Resolution — Six Fallbacks Before Giving Up
&lt;/h2&gt;

&lt;p&gt;Determining what to call each element is harder than it sounds. We use a priority chain that mirrors the &lt;a href="https://www.w3.org/TR/accname-1.2/" rel="noopener noreferrer"&gt;W3C accessible name computation&lt;/a&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Priority&lt;/th&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aria-label&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;button aria-label="Close dialog"&amp;gt;X&amp;lt;/button&amp;gt;&lt;/code&gt; -&amp;gt; "Close dialog"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;aria-labelledby&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;References another element's text by ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;alt&lt;/code&gt; attribute&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;img alt="Product photo"&amp;gt;&lt;/code&gt; -&amp;gt; "Product photo"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;&amp;lt;label for="..."&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Associated form label&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;placeholder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Input placeholder text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;title&lt;/code&gt; attribute&lt;/td&gt;
&lt;td&gt;Tooltip fallback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Text content&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;&amp;lt;button&amp;gt;Save Changes&amp;lt;/button&amp;gt;&lt;/code&gt; -&amp;gt; "Save Changes"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The good news: most well-built apps work fine at priority 7. Buttons have visible text, headings have visible text, links have visible text. ARIA attributes are enhancements, not requirements. If your app uses a component library with semantic HTML (&lt;code&gt;&amp;lt;button&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;table&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;h2&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;a href&amp;gt;&lt;/code&gt;), the walker assigns correct roles automatically — no ARIA needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 3: Visual Cue Annotations
&lt;/h2&gt;

&lt;p&gt;Here's a gap that surprised us. The a11y tree captures semantic structure perfectly — but it doesn't capture &lt;em&gt;visual presentation&lt;/em&gt;. During testing, a user asked "what is this blue alert?" about an info banner. The LLM couldn't identify it because the a11y tree rendered it as plain text with no color or severity metadata.&lt;/p&gt;

&lt;p&gt;The same problem hits status badges ("Published" is green, "Error" is red), highlighted rows, and icon-only indicators. The user sees color-coded meaning; the LLM sees flat text.&lt;/p&gt;

&lt;p&gt;The solution: a static map of CSS selectors to semantic annotations, checked via &lt;code&gt;Element.matches()&lt;/code&gt; during the tree walk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;VISUAL_CUE_MAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.alert-info, .banner--info&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visual=blue-info-banner&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.alert-warning, .banner--warning&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visual=yellow-warning-banner&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.alert-error, .banner--error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visual=red-error-banner&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.badge--success, .status--active&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visual=green-badge&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.badge--error, .status--error&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visual=red-badge&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getVisualCue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Element&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;VISUAL_CUE_MAP&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The enriched output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[region visual=blue-info-banner] This item requires attention. Review the listing.
[cell visual=green-badge] Published
[cell visual=red-badge] Error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Zero runtime cost (CSS selector matching is near-instant), fully deterministic, and the LLM prompt can explain what each annotation means. The trade-off is maintaining the selector map as the design system evolves — but that's a small price for giving the LLM color awareness.&lt;/p&gt;

&lt;p&gt;Token impact is negligible: ~2-5 extra tokens per annotated element, ~20-100 total per page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 4: Hidden Controls Discovery via ARIA Hints
&lt;/h2&gt;

&lt;p&gt;Here's a problem unique to rich web applications: many critical controls are &lt;strong&gt;hidden&lt;/strong&gt;. Dropdown menus, popup editors, modals, side drawers — they only exist in the DOM after a trigger element is clicked. The a11y tree captures the trigger but not what it opens.&lt;/p&gt;

&lt;p&gt;On a single catalog page, we found 9 distinct hidden control types: item detail popups, inline price editors, action menus (edit/retire/delete), shipping configuration panels, lifecycle filters, price range filters, fulfillment type filters, sort drawers, and a full filter panel with 19 expandable sections.&lt;/p&gt;

&lt;p&gt;The a11y tree sees: &lt;code&gt;[button collapsed] $ 100.33&lt;/code&gt;. It doesn't know that clicking it opens a pricing editor with competitive pricing data, a base price input, and Apply/Close buttons.&lt;/p&gt;

&lt;p&gt;The partial solution comes from ARIA attributes that are already in the DOM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[button collapsed haspopup=menu] $ 100.33
[button collapsed haspopup=dialog] --
[button collapsed haspopup=listbox] Lifecycle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;aria-haspopup&lt;/code&gt; tells you &lt;em&gt;something&lt;/em&gt; is behind the button. &lt;code&gt;aria-controls&lt;/code&gt; can reference the target element by ID. The LLM now knows enough to say "click the price value — it opens a pricing menu" instead of giving generic instructions.&lt;/p&gt;

&lt;p&gt;For high-value pages, we layer a static action catalog on top — a JSON registry mapping trigger types to available actions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;KNOWN_ACTIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;price-editor&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Click price cell in table&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Base price (editable)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Competitive price&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Buy Box price&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Active pricing programs&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Update base price&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;View competitive pricing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;actions-menu&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Click three-dot icon in row&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Edit item&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Retire item&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Delete item&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Edit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Retire from marketplace&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Delete permanently&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ARIA enrichment is automatic (works on every page). The action catalog is manual but provides specifics for the pages that matter most. Together, they bridge the gap between "I see a clickable element" and "I know what it does."&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 5: The Stale Snapshot Problem
&lt;/h2&gt;

&lt;p&gt;This one bit us hard. The a11y tree is captured at a point in time — but if you capture at page load, you get the &lt;strong&gt;loading state&lt;/strong&gt;. Skeleton screens. Spinner text. "Loading..." placeholders.&lt;/p&gt;

&lt;p&gt;Here's the timeline of the bug:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T=0ms     User navigates to /catalog
T=200ms   Page shell renders (skeleton UI)
T=300ms   Data fetch fires (GET /api/items)
T=1000ms  A11y tree captured -&amp;gt; gets "Loading..." skeleton
T=1500ms  API response arrives -&amp;gt; React renders actual table
T=2000ms  User asks "What items do I have?"
T=2000ms  Message sent with stale "Loading..." snapshot
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our initial approach was a 1000ms debounce after navigation plus a MutationObserver that re-extracted on significant DOM changes (5+ added/removed nodes). But the MutationObserver had its own 1500ms debounce, and by the time it fired, the context had already been sent.&lt;/p&gt;

&lt;p&gt;The fix was conceptually simple: &lt;strong&gt;re-extract at the moment the user sends a message&lt;/strong&gt;, not at page load. When the user hits Send, the frontend captures a fresh a11y tree snapshot synchronously (~5-10ms on 500 nodes) and attaches it to the message payload. The snapshot is always current because it reflects exactly what the user sees when they ask their question.&lt;/p&gt;

&lt;p&gt;We kept the background extraction as a pre-cache for proactive features, but the on-send extraction always wins for message context. The MutationObserver still monitors for table row additions (a good heuristic for "data just loaded") to keep the background cache fresh, but it's no longer the critical path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trick 6: Structured Data Extraction and Table Deduplication
&lt;/h2&gt;

&lt;p&gt;The a11y tree handles layout and UI state well, but for data tables it represents values positionally — the LLM has to count cells to figure out which column a value belongs to. Ask "what's the price of Laptop Pro 15?" and the model needs to count across: Item Name, SKU, Status, &lt;em&gt;Price&lt;/em&gt;. For a table with 25 rows and 11 columns, this is error-prone.&lt;/p&gt;

&lt;p&gt;The fix: for pages with data tables, extract a parallel &lt;strong&gt;structured data&lt;/strong&gt; representation — read &lt;code&gt;thead&lt;/code&gt; for headers, map each &lt;code&gt;tbody&lt;/code&gt; row by position, and output clean JSON with explicit header-to-value mapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"headers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Item Name"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SKU"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Price"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"Item Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Laptop Pro 15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"SKU"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sku-0082"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Unpublished"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$299.00"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"Item Name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Widget B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"SKU"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sku-1234"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Published"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"Price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"$49.99"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
 &lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the LLM doesn't count — it reads &lt;code&gt;"Price": "$299.00"&lt;/code&gt; directly.&lt;/p&gt;

&lt;p&gt;But this creates a duplication problem. The table data now appears in both the a11y tree &lt;em&gt;and&lt;/em&gt; the structured JSON. On a catalog page with 25 rows, that wastes ~400-700 tokens — 35-40% of the combined payload.&lt;/p&gt;

&lt;p&gt;The fix is conditional exclusion: when the structured data extractor succeeds, skip table-related roles during the a11y tree walk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;TABLE_ROLES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;table&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rowgroup&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;row&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;columnheader&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cell&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rowheader&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gridcell&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;structuredData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;a11yResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildA11yTreeSnapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;excludeRoles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TABLE_ROLES&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;a11yResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildA11yTreeSnapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;root&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// full tree as fallback&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The a11y tree shrinks from 7,618 chars to 505 chars (93% reduction). Total prompt length drops 35%. The table data lives exclusively in the structured JSON where the LLM has explicit header-to-value mapping — no positional counting needed.&lt;/p&gt;

&lt;p&gt;The key is making it &lt;strong&gt;conditional&lt;/strong&gt;. Pages without a registered extractor still get the full a11y tree as their only table representation. The skip only activates when structured data provides a superior alternative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting It All Together: The Prompt Template
&lt;/h2&gt;

&lt;p&gt;Every trick above feeds into one thing: the system prompt the LLM actually sees. Here's what the assembled prompt looks like for a data table page (sanitized from our production template):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;You are an AI assistant for [platform name].

The user is on a page in the platform. Below is a description of what's
currently visible on their screen.

--- Structured Data (machine-readable table data) ---
Use this for data questions (counting, comparing values, filtering):
&lt;span class="nt"&gt;&amp;lt;structured_data&amp;gt;&lt;/span&gt;
{"headers": ["Item Name", "SKU", "Status", "Price"],
 "data": [
   {"Item Name": "Laptop Pro 15", "SKU": "sku-0082", "Status": "Unpublished", "Price": "$299.00"},
   {"Item Name": "Widget B", "SKU": "sku-1234", "Status": "Published", "Price": "$49.99"}
 ]}
&lt;span class="nt"&gt;&amp;lt;/structured_data&amp;gt;&lt;/span&gt;

--- Page Content (layout and UI elements) ---
Use this for layout/navigation questions (where is X, what buttons exist):
&lt;span class="nt"&gt;&amp;lt;page_content&amp;gt;&lt;/span&gt;
[heading level=1] Catalog
[tab selected] All
[tab] Unpublished (39746)
[button collapsed haspopup=menu] Filters
[button collapsed haspopup=menu] Sort
[searchbox] Search items
[status] Showing 1-25 of 114,827 items
&lt;span class="nt"&gt;&amp;lt;/page_content&amp;gt;&lt;/span&gt;

--- Hidden Controls (not visible in page content above) ---
These controls exist but appear only after clicking a trigger element.
&lt;span class="nt"&gt;&amp;lt;hidden_controls&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Click any item name to open a detail popup showing full item info,
  images, and status history
&lt;span class="p"&gt;-&lt;/span&gt; Click a price cell to open a pricing editor with base price,
  competitive pricing, and Buy Box data
&lt;span class="p"&gt;-&lt;/span&gt; Click the three-dot icon on any row for actions: Edit, Retire, Delete
&lt;span class="p"&gt;-&lt;/span&gt; Click "Filters" to expand 19 filter sections: Lifecycle, Price Range,
  Fulfillment Type, etc.
&lt;span class="p"&gt;-&lt;/span&gt; Click "Sort" to choose: Item Name, Price, Status, Date Created
&lt;span class="p"&gt;-&lt;/span&gt; Type in the search box to filter items by name, SKU, or GTIN
&lt;span class="nt"&gt;&amp;lt;/hidden_controls&amp;gt;&lt;/span&gt;

Page Name: Catalog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things to notice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two representations, not one.&lt;/strong&gt; Structured data (JSON) handles precise data questions — "what's the cheapest item?" requires comparing values, which JSON makes trivial. The a11y tree (text) handles spatial questions — "what tabs are available?" or "is there a search box?" The LLM is told which to use for what.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Section headers are instructions.&lt;/strong&gt; "Use this for data questions" and "Use this for layout questions" aren't decorative — they steer the model's attention. Without them, the LLM sometimes ignores the structured data and tries to answer data questions from the a11y tree text, which requires positional counting and fails on large tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The a11y tree is deduplicated.&lt;/strong&gt; Notice the page content section has no table rows — those live exclusively in the structured JSON. This is Trick 6 in action, saving ~35% of the combined token budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hidden controls fill the gap.&lt;/strong&gt; The a11y tree shows &lt;code&gt;[button collapsed haspopup=menu] Filters&lt;/code&gt; but can't describe what's inside. The hidden controls section tells the LLM there are 19 filter sections — so it can say "click Filters to access Lifecycle, Price Range..." instead of "there's a Filters button."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total cost: ~1,500-2,500 tokens&lt;/strong&gt; for a page that would be ~37,500 tokens as raw HTML. That's the 30-50x reduction with full context preserved — structured data for precision, a11y tree for layout, hidden controls for discoverability.&lt;/p&gt;

&lt;p&gt;This is the extraction side of the problem. How you use this context — query classification, routing, RAG enrichment — is up to your agent architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with on-send extraction, not page-load extraction.&lt;/strong&gt; We spent cycles debugging stale snapshot timing issues that would have been avoided entirely by capturing at message-send time from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the structured data extractor as generic, not page-specific.&lt;/strong&gt; Our first extractor was custom for one page type. But the logic — read headers from &lt;code&gt;thead th&lt;/code&gt;, map row cells by position — works on &lt;em&gt;any&lt;/em&gt; standard HTML table. A generic table auto-detector would have covered 80% of pages with zero per-page work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't skip SVGs entirely.&lt;/strong&gt; We initially skipped all &lt;code&gt;&amp;lt;svg&amp;gt;&lt;/code&gt; elements as "visual-only." But many convey meaning — checkmarks, warning triangles, info circles. Checking for &lt;code&gt;aria-label&lt;/code&gt;, parent labels, and &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; elements recovers semantic meaning from icons that would otherwise produce zero output.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;p&gt;For a typical data table page with 25 rows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Representation&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw HTML&lt;/td&gt;
&lt;td&gt;~150 KB&lt;/td&gt;
&lt;td&gt;~37,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DOM JSON (pruned)&lt;/td&gt;
&lt;td&gt;~20-50 KB&lt;/td&gt;
&lt;td&gt;~5,000-12,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown&lt;/td&gt;
&lt;td&gt;~10-20 KB&lt;/td&gt;
&lt;td&gt;~2,500-5,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A11y tree&lt;/td&gt;
&lt;td&gt;~3-5 KB&lt;/td&gt;
&lt;td&gt;~750-1,250&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A11y tree + structured JSON&lt;/td&gt;
&lt;td&gt;~5-8 KB&lt;/td&gt;
&lt;td&gt;~1,250-2,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A11y tree (deduplicated) + JSON&lt;/td&gt;
&lt;td&gt;~3-5 KB&lt;/td&gt;
&lt;td&gt;~800-1,250&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The a11y tree approach gives us a 30-50x reduction over raw HTML while preserving everything the LLM needs: semantic roles, interactive states, element names, and current values. The deduplication trick shaves another 35% when structured extractors are available.&lt;/p&gt;

&lt;p&gt;Extraction takes &amp;lt;10ms on the main thread for 500 nodes. No external dependencies. No vision models. No API calls. Just a recursive DOM walk using the same principles screen readers have relied on for years.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/microsoft/playwright-mcp" rel="noopener noreferrer"&gt;Playwright MCP&lt;/a&gt; — Uses accessibility snapshots as its primary page representation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/kuroko1t/how-accessibility-tree-formatting-affects-token-cost-in-browser-mcps-n2a"&gt;How Accessibility Tree Formatting Affects Token Cost in Browser MCPs&lt;/a&gt; — Token cost analysis across serialization formats&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2511.19477" rel="noopener noreferrer"&gt;Building Browser Agents: Architecture, Security, and Practical Solutions&lt;/a&gt; — Academic survey on a11y trees as the dominant page representation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2410.13825" rel="noopener noreferrer"&gt;AgentOccam&lt;/a&gt; — Research showing plain a11y trees match or beat vision-augmented approaches&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/" rel="noopener noreferrer"&gt;Reader-LM&lt;/a&gt; — Jina's alternative approach via HTML-to-Markdown conversion&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.w3.org/TR/accname-1.2/" rel="noopener noreferrer"&gt;W3C Accessible Name and Description Computation&lt;/a&gt; — The spec behind Trick 2's name resolution&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.w3.org/TR/wai-aria-1.2/#roles_categorization" rel="noopener noreferrer"&gt;WAI-ARIA Roles Model&lt;/a&gt; — Role taxonomy used in Trick 1&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The accessibility tree was designed to make the web understandable for people who can't see it. Turns out, it's also the best way to make the web understandable for AI that can't see it either. Same tree, different consumer, same principle: semantic structure beats raw presentation every time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in this series: giving the agent temporal awareness — tracking what users were doing, not just what's on the page.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>a11y</category>
      <category>frontend</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
