DEV Community

prof2k
prof2k

Posted on

Waypoint — Building a Universal Intent Layer for the Web

This article was written as part of my submission for the Google Gemini Live Agent Challenge. #GeminiLiveAgentChallenge


The problem nobody talks about

Try navigating a website using only your keyboard right now.

Press Tab. Keep pressing it. Watch the focus indicator jump unpredictably across hundreds of elements before you reach the one thing you actually wanted.

This is the daily reality for millions of people with motor disabilities, visual impairments, or conditions that make using a mouse impossible or painful. Tab navigation is the web's accessibility fallback. And it's broken.

Not because it doesn't work technically. But because it was designed around actions — move to next element, press enter — rather than intent — I want to find the navigation, I want to search, I want to buy this.

The gap between those two things is where Waypoint lives.


What exists today and why it falls short

Current accessibility tools fall into a few buckets:

Screen readers like JAWS and NVDA are powerful but have a steep learning curve. They read what's there. They don't understand what you want.

Voice control tools like Dragon NaturallySpeaking let you speak commands — but they're essentially voice-powered mouse clicks. "Click sign in." It's still action-first. It requires you to know exactly what the UI calls the thing you want.

Browser accessibility APIs give developers tools to make sites accessible — but only if developers use them correctly. Most don't.

The fundamental problem with all of these: they require the user to adapt to the interface. Waypoint flips that. The interface adapts to the user.


The core insight: intent over action

Every website regardless of how it's built has the same underlying anatomy:

  • Navigation — somewhere to go
  • Primary purpose — a reason the page exists
  • Controls — things to do

The markup might be a mess. The developer might have used <div class="nav-wrapper"> instead of <nav>. None of that matters if you understand what the page is for and what a user could want to accomplish on it.

That's intent. Not "click this element." But "I want to navigate."

Waypoint is built around three primitives:

DISCOVERDOCUMENTACTIVATE

  • DISCOVER — analyze the page, map every possible intent
  • DOCUMENT — store it as a living Intent Surface Map
  • ACTIVATE — resolve natural speech against that map and execute

Everything in the codebase serves one of these three primitives. Nothing else.


The architecture

The three-layer indexing pipeline

The hardest problem is turning an arbitrary webpage — with inconsistent markup, no guaranteed semantic HTML, no ARIA — into a structured map of what a user can do. Waypoint solves this with a three-layer indexing pipeline that runs when the user clicks "Index this page."

Layer 0 — Structural (0ms, client-side, no AI)

The content script walks the entire DOM and stamps every interactive element — every <a>, <button>, <input>, <select>, <textarea> — with a data-wp-id attribute. Short random IDs. Six characters. These stamps are the bridge between the AI's output and the live DOM.

At the same time, semantic HTML tags — <nav>, <main>, <form>, <header>, <footer>, <dialog> — get bare-bones surfaces created immediately. No AI. Voice activates against these instantly. Zero latency for the most common intents.

Layer 1 — Text enrichment (~1s, Gemini Flash)

The backend receives three things: the containment tree (DOM structure with positions and text), the fixed/absolute elements (sticky nav, floating buttons — flagged as likely universal controls), and the flat interactives list (every stamped element).

Gemini's job: group related elements into named surfaces, write voice-natural labels and trigger phrases including synonyms, classify the page type and purpose. The critical constraint in the prompt: every element in the interactives list must become a surface. No coverage gaps.

Layer 2 — Vision enrichment (~2-4s, Gemini Flash multimodal)

A screenshot is captured via chrome.tabs.captureVisibleTab(). Gemini receives the screenshot AND the Layer 1 surface map. Its job is narrower: find what DOM analysis missed. Image carousels. Icon-only buttons. Visual hierarchy. Hero CTAs identified by visual prominence. Cookie banners. Chat widgets.

Merge logic uses data-wp-id as the deduplication key. Layer 2 wins on conflicts. Layer 0 surfaces not mentioned by Gemini are kept as-is — this is the 100% coverage guarantee.


The Intent Surface Map

The output of indexing is a structured document:

{
  "url": "https://example.com/shop",
  "pageType": "ecommerce",
  "purpose": "Browse and purchase products",
  "surfaces": [
    {
      "id": "nav-main",
      "type": "WAYFINDING",
      "label": "Main Navigation",
      "triggers": ["navigation", "nav", "menu", "links"],
      "confidence": 0.95,
      "action": {
        "type": "SCROLL_TO",
        "target": "[data-wp-id=\"a3f7\"]"
      }
    },
    {
      "id": "btn-cart",
      "type": "CONTROLS",
      "label": "Add to Cart",
      "triggers": ["add to cart", "buy", "purchase", "cart"],
      "action": { "type": "CLICK", "target": "[data-wp-id=\"c9d2\"]" }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Four surface types: WAYFINDING, PURPOSE, CONTROLS, CONTEXT. Seven action types: OVERLAY, SCROLL_TO, FOCUS, FILL, CLICK, READ, DISMISS.

The map is cached in chrome.storage.local keyed by origin + pathname. On cache hit the map loads immediately and refreshSurfaceTargets() re-stamps the DOM — stamps don't survive page reloads.


The voice pipeline — Gemini Live

This is the architectural bet that makes Waypoint different.

Gemini Live handles STT, NLU, TTS, and conversation state in a single bidirectional WebSocket. No separate speech-to-text service. No separate NLP layer. No separate TTS. One connection. One stream. Everything.

How it works:

The mic stream goes through an AudioWorklet that converts Float32 samples to Int16 PCM in a separate thread. Buffers are base64-encoded and sent continuously to the WebSocket as realtimeInput.mediaChunks at 16kHz.

Gemini streams back PCM16 audio at 24kHz. Chunks are queued using nextPlayAt — each chunk's start time is set to when the previous chunk ends, preventing gaps or overlaps.

The key architectural decision — tool calls:

Gemini doesn't return text that needs parsing. It returns structured function calls:

activate_surface({ surfaceId: "btn-checkout" })
scroll_page({ direction: "down" })
enter_click_mode()
navigate_highlight({ direction: "right" })
activate_highlight()
Enter fullscreen mode Exit fullscreen mode

When the user says "go to checkout" Gemini calls activate_surface({ surfaceId: "btn-checkout" }). The extension looks up that surface in the intent map, executes the action against the live DOM, and acknowledges the tool call back to Gemini.

Natural language understanding stays in the model. DOM interaction stays in the extension. Clean separation.

The system prompt:

At session start Gemini receives the full intent map — every surface, its id, type, label, triggers, and action type. This is Gemini's complete menu of what the user can do on this page. The conversation is grounded in the actual page from the first word.

Interruption:

If the user speaks while Gemini is talking, serverContent.interrupted fires. nextPlayAt resets to currentTime — queued audio drops immediately. Barge-in works naturally.


The action execution layer

execute(surface) dispatches to one of seven handlers. The interesting part is findTarget() — a five-stage fallback chain that finds the actual DOM element even when things go wrong:

  1. Parse data-wp-id from the target string → querySelector
  2. Try the target string as a raw CSS selector
  3. Try [aria-label="<surface.label>"]
  4. Search headings for text matching surface triggers → return nearest section ancestor
  5. Broad text search across all interactive and landmark elements

This chain exists because data-wp-id stamps break on SPA navigation, dynamic renders, and cached maps. The system degrades gracefully rather than failing silently.

doFill() focuses the field and switches voice mode to dictation — subsequent speech is text input, not a command. doSubmit() requires a confirmation utterance before executing. doDismiss() fires both an Escape keydown and clicks common close button selectors in case the modal ignores keyboard events.


Click mode

Click mode handles elements not in the intent map — or when the user wants precise control.

enter_click_mode → scans all focusable elements in the viewport, filters nav/header elements and invisible elements, sorts row-major (top-to-bottom, left-to-right). The first element highlights immediately.

The visual overlay is a position: fixed div injected directly into document.documentElement — not shadow DOM, since it needs to sit above all page content. It has a pulsing animation, a label chip, and a counter badge (3/12).

navigate_highlight(direction) cycles through elements. activate_highlight() clicks and exits click mode automatically.


The backend

Express on Cloud Run (Node 20). Two endpoints:

POST /index/text — receives the pruned containment tree + flat interactives list + Layer 0 surfaces. Calls Gemini Flash text-only. Returns enriched surface map.

POST /index/vision — receives base64 JPEG screenshot + current surface map. Calls Gemini Flash multimodal. Returns only new or improved surfaces — not a re-enumeration of everything.

GET /config — serves the Gemini API key to the extension. The key lives in the Cloud Run environment variable. Never in the extension bundle.

The Vertex AI SDK handles auth via Application Default Credentials on Cloud Run. No key management needed for backend Gemini calls.


Shadow DOM strategy

Shadow DOM is used in two places for opposite reasons:

Debug panel — isolates panel CSS from page CSS. Without it page styles bleed in.

Overlay host — contains overlay scripts and styles so they don't affect page layout.

The visual overlays (highlight box, scroll indicator) are NOT in shadow DOM — injected directly into document.documentElement. They need z-index: 2147483645 to sit above all page content. Shadow DOM creates its own stacking context that can be clipped by the page.


What was intentionally removed

Local NLU — earlier versions had fuzzy matching, keyword scoring, synonym lookup on the client. All removed. Gemini Live handles this better. The latency cost is offset by richer, more reliable results.

Web Speech API — removed entirely. Required a separate endpoint, had a network error loop on restricted networks, needed a separate NLU pass anyway.

Google Cloud TTS — removed. Gemini Live speaks its own responses. Browser speechSynthesis is kept only for system messages outside an active session.

Intent caching across pages — cache is keyed by origin + pathname. Different pages get different indexes. Surfaces are too page-specific to reuse.

Confidence threshold filtering — every surface Gemini returns is kept regardless of confidence. The model's judgment is trusted. Filtering happens at query time via Gemini Live.


What I learned

The hardest problem wasn't the AI integration. It was defining intent precisely enough to be useful without being so narrow it became another command parser.

The data-wp-id stamping approach — a persistent bridge between AI output and live DOM — turned out to be the most important implementation detail in the entire system. Everything depends on it.

And the architectural bet on Gemini Live as the single voice pipeline — STT, NLU, TTS, conversation state all in one WebSocket — was the right call. Simpler system. Better results. Natural conversation without any orchestration layer.


What's next

Persistent intent maps in Firestore — returning to a visited page is instant. Cross-session learning. A shared intent vocabulary across all Waypoint users — crowdsourced, Gemini-validated, continuously improving.

The web was built for everyone. Waypoint is making that promise real.


Built with Gemini 2.0 Flash, Gemini Live API, Vertex AI, Cloud Run, Chrome Extensions Manifest V3, and the Google GenAI SDK.

#GeminiLiveAgentChallenge

Top comments (0)