DEV Community

Gaurav Chodwadia
Gaurav Chodwadia

Posted on

Giving AI Agents Eyes (Part 2): From Page Snapshots to Interaction Traces

In Part 1, we solved the representation problem: how to give an LLM a compact, semantic view of a web page using accessibility trees. That gave our AI agent the ability to answer "what is on this page?"

But users don't ask that. They ask "what did I just click?" and "what was in that popup I closed?" Here's the gap, concretely:

T=0s    User clicks "Blue Widget" in a data table
T=1s    Popup appears with item details (price, status, GTIN)
T=3s    User closes popup
T=5s    User opens AI assistant, asks "what's wrong with this item?"
T=5s    Assistant captures page state: sees 25 items in table, no popup
T=5s    Assistant has zero context about which item or what was in the popup
Enter fullscreen mode Exit fullscreen mode

A page snapshot is a photograph. The user is asking about a video. The agent needs two things it doesn't have: (1) a log of recent interactions (the user clicked "Blue Widget"), and (2) snapshots of ephemeral UI that no longer exists (the popup that showed price $24.99 and status Published).

Session replay tools (rrweb, FullStory, LogRocket) solve a related problem, but they produce DOM serialization data designed for visual playback — not semantic descriptions for LLM consumption. You need ~200 tokens of natural language, not 200 KB of mutation records.

This post covers how we built a user-activity tracker from inside a Module Federation remote that doesn't own the host page, three bugs that compounded into a "one behind" symptom, and how we serialize 20 interaction events into a few hundred tokens of LLM-friendly text.

Strategy Pattern for Extensible Event Capture

Click events are one source — but probably not the only one we'll ever want. The host app might later emit structured events via a PubSub system; a future host could fire shell-level navigation events that we'd want to record too. We didn't want to rewrite the tracker the day either of those landed.

                     ActivitySource (interface)
                    /         |          \
                   /          |           \
        DomActivitySource  HostEventSource  HybridSource
        (capture-phase       (PubSub from    (both, with
         click listener)      host app)       priority rules)
                   \          |          /
                    \         |         /
                     useActivityTracker (hook)
                             |
                     Feature flag selects source at runtime:
                     "dom" | "host" | "hybrid" | "off"
Enter fullscreen mode Exit fullscreen mode

The interface is minimal:

interface ActivitySource {
  start(): void;
  stop(): void;
  getTrace(): UserActivityTrace;
  clear(): void;
  onEvent?: () => void;  // callback when a new event is captured
}
Enter fullscreen mode Exit fullscreen mode

Sources are plain classes; the React layer is one thin hook (useActivityTracker) that owns construction and cleanup. Hooks would have worked too — this isn't a religious choice. We went with classes for three pragmatic reasons:

  1. Composability without React. HybridSource creates a DomActivitySource and HostEventSource internally and merges their outputs. Doable with hooks, but composition requires wrapper components or hook-forwarding patterns; instantiating two class instances and merging is simpler.

  2. Testability without rendering. You can instantiate a DomActivitySource in a unit test with jsdom, call start(), simulate clicks, and assert on getTrace() — no renderHook, no act(), no React tree.

  3. Future-proofing for non-React consumers. If the tracker ever needs to run in a worker, a vanilla shell, or a non-React MFE, a plain class moves over unchanged.

Capture-Phase Click Listeners

Our AI assistant is loaded into a large SaaS dashboard as a Module Federation remote. It's not an iframe — it shares the same document as the host application. This is the architectural fact that makes activity tracking possible.

Because we share the DOM, a single listener captures every click on the host page:

document.addEventListener("click", this.handleClick.bind(this), { capture: true });
Enter fullscreen mode Exit fullscreen mode

The capture phase ({ capture: true }) is important. It fires before the target element's own click handlers, before any bubble-phase stopPropagation() calls in the host app can prevent us from seeing the event. (Caveat: a host-app listener installed earlier on document in the same phase that calls stopImmediatePropagation() will still hide events from us. We've never seen this in the wild — most apps only stopPropagation() from inside component handlers, which is bubble-phase — but it's the one way the host can blind us if they want to.)

But we don't want to log every click. A click on a wrapper div, a scroll container, or a decorative icon is noise. The extractClickDescriptor function walks up from the event target to find the nearest actionable element:

const ACTIONABLE_ROLES = new Set([
  "button", "link", "menuitem", "tab", "checkbox",
  "radio", "combobox", "textbox", "searchbox", "switch", "row",
]);

function findNearestInteractive(target: Element): Element | null {
  let el: Element | null = target;
  while (el) {
    // Skip our own UI
    if (el.matches('[data-assistant-container="true"]')) return null;
    // Respect opt-out attribute
    if (el.matches("[data-no-track]")) return null;

    const role = getRole(el);
    if (role && ACTIONABLE_ROLES.has(role)) return el;
    el = el.parentElement;
  }
  return null;
}
Enter fullscreen mode Exit fullscreen mode

The walk-up pattern is essential because the actual click target is usually a child of the interactive element — a <span> inside a <button>, an <svg> inside a link. The allowlist focuses on ARIA roles that represent user-initiated actions, which eliminates the vast majority of noise.

Additional noise filters:

  • Debounce — Same element clicked within 300ms is suppressed (double-click)
  • Name required — Elements with no accessible name and no text content are ignored
  • Container guard — Clicks inside the AI panel itself are excluded

The result on a typical data-table page: roughly one actionable event per real click (occasionally two when a row + nested button both qualify), out of dozens of raw click events the page absorbs. No firehose.

Three Bugs That Compounded

The symptom looks like one bug. It's three.

Your activity trace is always one interaction behind. The user clicks Item A, asks the assistant, and the assistant either sees nothing or sees the previous click. That's the single symptom we chased.

What we actually had was three failure modes of the same underlying mismatch — synchronous DOM events meeting asynchronous React state. Each gave the bug somewhere to hide. Fixing any one of them in isolation didn't move the needle.

The setup: The remote MFE captures DOM events synchronously. The captured data flows into React state, which flows through React Context to a consumer component in a different MFE (the AI chat UI), which reads the context when the user sends a message.

Bug 1: Conditional trace inclusion. Page-content extraction fires under three triggers — initial page load, significant DOM mutations, and route navigation. We initially only included the activity trace when the trigger was "message-send". The other triggers produced snapshots without traces. React deduplication kept the traceless version in state, so the chat UI read an empty trace at send time.

Fix: Always include the trace in every extraction, regardless of trigger.

Bug 2: No extraction on click events. Click events trigger none of the three extraction conditions above. The snapshot already in state has the trace from the last extraction — which captured clicks before the latest one.

Fix: Add an onEvent callback to the source interface. When a click is captured, fire the callback. In the page-content hook, the callback patches React state with the latest trace:

// In the page-content hook
const handleActivityEvent = useCallback(() => {
  const trace = getTraceRef.current();
  setSnapshot((prev) => {
    if (!prev) return prev;
    return { ...prev, userActivity: trace };
  });
}, []);
Enter fullscreen mode Exit fullscreen mode
// In the DOM source's click handler
this.events.unshift(descriptor);
this.onEvent?.();  // patch React state immediately
Enter fullscreen mode Exit fullscreen mode

Bug 3: Undeclared class property. The ActivitySource interface declares onEvent?: () => void, and the hook assigns it after construction:

source.onEvent = () => onEventRef.current?.();
source.start();
Enter fullscreen mode Exit fullscreen mode

But the class never declared onEvent as a field:

// BROKEN — onEvent is not declared on the class
class DomActivitySource implements ActivitySource {
  private events: ActivityEvent[] = [];
  // ...
  handleClick(event: MouseEvent): void {
    // ...
    this.onEvent?.();  // always undefined
  }
}
Enter fullscreen mode Exit fullscreen mode

In plain JS, source.onEvent = fn on an instance creates an own property that any later method call would see through this. We didn't get that — somewhere in our toolchain (TypeScript with useDefineForClassFields, plus a class-fields transform in the build pipeline), the assignment landed somewhere the class methods didn't read from. The exact mechanism was build-config specific, and chasing it stopped being interesting once we found the one-line fix:

// FIXED — explicit declaration
class DomActivitySource implements ActivitySource {
  onEvent?: () => void;    // this line fixes the bug
  private events: ActivityEvent[] = [];
  // ...
}
Enter fullscreen mode Exit fullscreen mode

Declaring the field made initialization deterministic and the bug went away across every build target. What made it hard to find in the first place: handleClick works perfectly when you call it directly in a unit test. It only failed in production, where the method was bound and invoked by document.addEventListener.

General lesson: When you implement an interface with optional callback properties, declare them explicitly on the class. Don't rely on whatever your transpiler does with externally-assigned-but-undeclared properties — the semantics are subtle enough that an explicit field is the safer default. One line of TypeScript is cheaper than three days of debugging.

Token-Efficient Activity Traces for LLMs

A raw JSON array of 20 events runs ~600 tokens once you account for repeated keys, escaped quotes, and structural noise. We compress it to ~300 by serializing to compact natural language:

RECENT USER ACTIONS (newest first):
- 14:32:05 click [button] "Edit Item" (in row: Blue Widget SKU-1234)
- 14:31:58 click [link] "Blue Widget" (in row: Blue Widget SKU-1234)
- 14:31:45 click [button] "Apply" (in toolbar: Filters)
- 14:31:30 click [checkbox] "Active" (in group: Lifecycle filter)
Enter fullscreen mode Exit fullscreen mode

The non-obvious choice in this format is row context, not landmark context. For data-table apps, "in row: Blue Widget SKU-1234" is far more useful than "in region: main content." The row context tells the LLM which data record the interaction was about. We extract it by walking up to the nearest <tr> or [role="row"] and joining the cell text with pipes (truncated to 60 chars). The LLM almost always asks about that one row — and now it knows which one.

The other choices are more obvious in hindsight: HH:MM:SS timestamps over ISO strings (session-relative time is enough, saves ~15 chars per line); [button] / [link] role brackets matching the same taxonomy the a11y tree uses; newest-first ordering so the most-likely-relevant action is at the top.

The buffer. 20 events in a circular ring, oldest evicted first, reset on page navigation. At a leisurely 1 click every 2-5 seconds, 20 covers ~40-100 seconds — enough for a complete workflow (filter, search, click item, browse popup, close popup). Real users sometimes burst faster than that; we cover that case in the limitations section.

Token cost. A typical event line tokenizes to ~15-25 tokens (timestamps and quoted names split poorly across tokenizers). A full 20-event buffer is ~300-500 tokens; popup snapshots add another ~200-500 each. Total overhead is roughly 8-15% on top of the existing page context payload — small per request, worth multiplying through your DAU × queries before calling it "negligible" at scale.

The format is injected into the LLM prompt with a framing instruction:

--- Recent User Activity ---
Use this to understand the user's intent and which item/element they are asking about.
<user_activity>
RECENT USER ACTIONS (newest first):
- 14:32:05 click [button] "Edit Item" (in row: Blue Widget SKU-1234)
...
</user_activity>
Enter fullscreen mode Exit fullscreen mode

The framing line ("Use this to understand...") matters more than its character count suggests. In our eval set, removing it noticeably increased the rate at which the model ignored the activity section entirely and answered the page-snapshot question instead of the per-user-action question. Treat the framing as load-bearing, not decorative.

The Full Pipeline

End to end, one click becomes one line in a prompt:

click → capture-phase listener → extractClickDescriptor() → ring buffer (20)
      → onEvent → setState → React Context → chat UI on send
      → POST with userActivity → backend NL serializer → <user_activity> tag in prompt
Enter fullscreen mode Exit fullscreen mode

Each arrow above is either a synchronous in-process call or a single React render. The whole path runs in well under a frame; no network hops on the FE side.

Before vs After: What the Assistant Says

Back to the opening sequence — user clicks "Blue Widget", a popup with item details appears, user closes it, then opens the assistant. Here's what the LLM produces with and without the activity trace, given the same page state at message-send time (the table is on screen; the popup is gone):

Without activity trace:

User: what's wrong with this item?

Assistant: I can see your catalog has 25 items. Could you tell me which item you're asking about? You can click an item's row to see details.

With activity trace (~300 extra tokens in the prompt):

User: what's wrong with this item?

Assistant: You recently opened the details popup for Blue Widget (SKU-1234). Its current status is Suppressed and its price is $24.99. I can see the item's basic fields, but the popup didn't tell me why it's suppressed — do you want me to check the listing-quality details, or is there a specific error message you saw?

Same page, same model, same prompt template. The only difference is 8 lines of recent-click history plus the popup's a11y snapshot riding in the payload. The agent went from "which item?" to "I know exactly which item and what state it was in." Figuring out why it's suppressed still requires a separate lookup (a RAG hit or a tool call) — the activity trace gives the agent the what, not the why.

What I'd Do Differently

Declare every interface property on the class, not just required ones. TypeScript optional properties are tempting to leave undeclared, but the runtime semantics under various class-field transforms are subtle enough that an explicit field declaration is the safer default. Cost: one line. Benefit: a category of bugs you'll never see.

Design the chat UI as an observer of the activity tracker, not a coordinator. Our first cut had the chat UI calling into the activity layer to request a fresh trace at send time, which coupled the two MFEs and introduced a "when is the trace ready?" synchronization problem. The cleaner pattern that emerged: the activity tracker pushes updates whenever something interesting happens; the chat UI just reads whatever state is current at send time. One-way data flow, no cross-MFE coordination. If we were starting over, that's the contract we'd nail down on day one.

Treat MF integration as observation by policy, not capability. Module Federation remotes share the host's document and global scope. Observation (event listeners, mutation observers) is trivial. Interception (patching window.fetch, wrapping a shared store, replacing a singleton) is also technically possible — it just couples your remote to the host's internals and breaks the moment the host changes implementation. We adopted "observers only" as a policy after almost building a fetch-patching debug helper that would have rotted within a release. The runtime won't enforce this for you.

Limitations You Should Know About

This is the section the post would lose credibility without. The pattern works well for our setup, but it has real gaps.

Shadow DOM is invisible. A capture-phase listener on document doesn't see clicks inside closed shadow roots, and even open shadow roots retarget the event target at the shadow boundary. The walk-up parentElement chain also stops at the boundary. If the host page uses Web Components — LitElement, Stencil, anything that mounts a design system into shadow DOM, or Salesforce Lightning Web Components — your tracker will miss most user interactions inside those components. The workaround is to read event.composedPath() instead of event.target and walk that array, but you still can't reach into a closed shadow root from the outside.

The role allowlist captures keystroke targets. textbox, searchbox, and combobox are in ACTIONABLE_ROLES, which means a click on an input followed by typing is recorded with the input's accessible name. If a user's row data — names, addresses, account numbers — appears in the row context we attach to each event, that data is now in the activity trace and ends up in the LLM prompt. Three implications:

  • PII / PHI / PCI: if your dashboard shows regulated data, the activity trace will exfiltrate it to whatever model provider you're using. For PHI/PCI environments, this pattern likely isn't deployable without on-prem inference and a redaction layer.
  • Passwords: <input type="password"> doesn't usually carry role="textbox", but a custom component that wraps a password input can. Maintain an explicit denylist (e.g., any field inside [data-sensitive] or matching input[type=password]) and skip both the click and the row-context extraction.
  • Default to opt-in for row context: rather than always capturing sibling-cell text, mark tables that are safe with data-track-row-context="true". Anything else gets the click event without the row content.

Bursty interactions overflow the buffer. The 20-event ring buffer assumes a leisurely 1 click every 2-5 seconds. Power users clearing 19 filter chips, or anyone doing bulk-select-then-action, can flush the buffer in under 3 seconds. The 300ms debounce only catches identical-element repeats, not bursty distinct events. If your app has these workflows, consider coalescing repeats ("clicked 8 filter chips: A, B, C…") or bumping the buffer to 50.

The "host-app exposure" assumption. Capturing events from a Module Federation remote works because the remote shares the host's document. Same-origin iframes can do this via parent.document. Cross-origin iframes need a postMessage bridge the host opts into. Chrome extension content scripts get DOM access automatically but live in an isolated JS world, so you lose React-fiber introspection. The pattern generalizes; the transport doesn't.

When This Pattern Applies

The core pattern — capture-phase listener, role-based filtering, circular buffer, compact NL serialization — works for AI agents embedded in role-rich, light-DOM SaaS UIs. Practical constraints:

  • You need DOM access to the host page. Module Federation in the same document is easiest. Same-origin iframes work via parent.document. Cross-origin iframes need a postMessage bridge. Pure Shadow-DOM hosts (Salesforce Lightning, heavy Web Components apps) need a different approach entirely.
  • You need semantic roles on interactive elements. ARIA roles, or at minimum semantic HTML (<button>, <a>, <input>), give you the vocabulary to describe what was clicked. Role-less sites (some Webflow / CMS-generated pages) won't yield much.
  • You need a framing instruction in the LLM prompt. Without explicit guidance, models will treat the trace as metadata.
  • You need a redaction story for any field that could carry PII, PHI, or PCI data. Default-opt-in is the wrong default for regulated environments.

If you're building something similar, I'd love to hear how you bridged the synchronous-event-meets-async-React-state gap. We went with a callback into setState; the obvious alternatives are a Zustand/Redux store the chat UI subscribes to, an RxJS subject, or a window-scoped event bus. Which would you have picked?


The accessibility tree gave the agent spatial awareness — what is on the page. The activity trace gives it temporal awareness — what the user was doing. Together, they let the agent answer the question every user actually asks: "I was just doing something — help me with that."

Top comments (0)