David Nzagha

Posted on Mar 16

Building WebClaw: A Personal Live AI Support Agent That Actually Operates Websites

#agents #ai

What if every website had an intelligent, voice-enabled agent that could actually do things for you, not just answer questions into the void?

How we built a live agent that sees, hears, speaks, and acts on web pages, powered by Google Gemini's bidirectional audio streaming.

The Problem Nobody Talks About

Website support is stuck in 2015. Chat widgets serve canned responses. Users abandon carts because they can't find the checkout button. Support staff asks, "Can you send a screenshot?" because they can't see what the user sees.

Every interaction starts from zero.

We wanted to build something different: an AI agent that doesn't just answer questions about a website, but actually operates it alongside you. Click buttons. Fill forms. Navigate pages. All through natural conversation.

That's WebClaw.

What WebClaw Does

WebClaw is a voice-first AI agent that lives on websites. Site owners add a single <script> tag (the same pattern as Google Analytics), and visitors get an animated avatar they can talk to. The agent can:

See the page through DOM snapshots and screenshots
Hear the user through real-time microphone streaming
Speak back with natural voice (supporting barge-in)
Act on the DOM: clicking, typing, scrolling, navigating, highlighting

Say "help me check out," and the agent navigates to the cart, walks you through the form, and confirms the order. It doesn't paste a link to the checkout page.

There's also a Chrome Extension mode (Personal Agent) that works on any website, not just integrated ones. When a Personal Agent visits a WebClaw-integrated site, the two negotiate: the site shares its knowledge base, and the agent keeps the user's data private. Privacy by architecture, not policy.

The Tech Stack

Gemini Live API: The Core

The entire project hinges on Gemini's bidiGenerateContent method. This isn't request-response; it's a persistent bidirectional stream. Audio flows in both directions simultaneously. The user can interrupt the agent mid-sentence (barge-in). The agent can call functions while speaking.

We use gemini-2.5-flash-native-audio-preview-12-2025 because it supports bidiGenerateContent for live audio with native voice generation and function calling. The native audio model produces higher-quality, more natural voice output than the older experimental models.

Google ADK: Agent Scaffolding

The Agent Development Kit handles the ceremony: session management, function-calling schema generation, and the LiveRequestQueue abstraction for feeding audio and text into the bidi stream. Our agent definition is clean:

root_agent = Agent(
    name="webclaw_agent",
    model="gemini-2.5-flash-native-audio-preview-12-2025",
    instruction=WEBCLAW_SYSTEM_PROMPT,
    tools=DOM_TOOLS,
)

Ten DOM tools are registered as typed Python functions. ADK converts them to Gemini function-calling schemas automatically. When the model decides to click a button, it returns a function_call event that we forward to the browser for execution.

The Gateway: FastAPI + WebSocket

The gateway sits between the browser and Gemini, running on Cloud Run. Each visitor gets a WebSocket connection that spawns two async tasks:

Browser --> Upstream Task --> LiveRequestQueue --> Gemini
                                                    |
Browser <-- Downstream Task <-- runner.run_live() <-+

Full duplex. The user speaks while the agent responds. DOM action results flow back while the agent processes the next thought.

Why not connect the browser directly to Gemini? Four reasons:

Privacy. The gateway enforces asymmetric context sharing. Site JavaScript can't intercept user data.
Security. DOM actions are validated against the site's permission list before execution.
Scalability. Cloud Run auto-scales with session affinity for WebSocket stability.
Analytics. Session history, message counts, and action metrics are all stored in Firestore.

The Embed Script: 26.1KB of TypeScript

The client-side embed script runs in a closed Shadow DOM (complete CSS isolation from the host page) and weighs 26.1KB minified. No runtime dependencies. It includes:

Animated avatar (Canvas 2D with lip-sync, eye blinks, state transitions)
Action visualizer (avatar flies to target elements via bezier curve with trail particles)
Screenshot capture (Canvas-based viewport rendering for vision context)
Audio pipeline (16kHz mic capture, 24kHz playback, raw PCM, no codecs)
Smart element finder (CSS selector, then aria-label, then text content match)

We chose esbuild for bundling: 2ms build time, zero config. The avatar uses Canvas 2D instead of Lottie (+50KB) or Three.js (+150KB) because every kilobyte matters in an embed script loaded on every page view.

Hard Problems We Solved

The Model Migration

We started development targeting gemini-2.0-flash-live-001. Midway through, we discovered it no longer exists in the API. We queried every available model for bidiGenerateContent support and found several options:

Model	bidi	generate	Notes
`gemini-2.5-flash-native-audio-preview-12-2025`	✅	❌	Current default; native audio with function calling
`gemini-2.5-flash-native-audio-latest`	✅	❌	Tracks latest stable native-audio model
`gemini-2.0-flash-exp-image-generation`	✅	✅	Legacy; broadest capability but older

We chose the native audio model for its superior voice quality and lower latency in real-time conversations.

Token-Efficient DOM Serialization

A typical webpage's full DOM is 50,000+ tokens. We needed to fit page context into the LLM's window without drowning out the conversation. Our serializer:

Includes only interactive elements (buttons, links, inputs) and semantic elements (headings, nav, main)
Excludes scripts, styles, SVGs, and iframes
Caps depth at 3 levels
Caps output at 4,000 characters

The result captures the page's interactive surface area in roughly 500 tokens.

Action Visualization

When the agent clicks a button, users need to see what happened. We built a Bezier flight animation: a glowing indicator launches from the avatar, arcs upward, and lands on the target element with a pulse ring effect. Trail particles follow with a delayed, staggered motion. The whole animation runs in 600ms using requestAnimationFrame with cubic ease-in-out.

This transforms "the agent clicked something" into "I watched the agent fly to that button and click it." Agency you can see.

Asymmetric Privacy

When a Personal Agent (Chrome Extension) visits a WebClaw-integrated site, context flows in one direction:

Site Knowledge ------> Agent Context <------ User Preferences
    (public)              (merged)              (private)
                                                   |
                                             NEVER flows to
                                             ----> Site Owner

This isn't a policy in the terms of service. It's infrastructure. The gateway physically can't leak user data to the site because the data flow is architecturally one-directional. The negotiation protocol (negotiate / negotiate_ack messages) establishes what the site offers and what the agent may do, without exposing who the user is or what they've done on other sites.

What We Built in Numbers

Component	Metric
Gateway	18 REST endpoints, WebSocket bidi streaming
Embed script	26.1KB minified, 8 TypeScript modules
Chrome Extension	MV3, 4 files, negotiation protocol
Dashboard	Vanilla HTML/JS, 5 pages, no build step
Documentation	14 pages, 3,700+ lines, 150KB+
DOM tools	8 operations (click, type, scroll, navigate, highlight, read, select, check)
Firestore collections	4 (sites, sessions, knowledge, stats)

Lessons Learned

Voice-first changes everything. When you design for speech as the primary modality, the entire UX shifts. Text fallback is trivial to add; designing for voice from scratch is not. We built the audio pipeline first and added text second.

Shadow DOM is non-negotiable for embeds. We tried CSS namespacing first. Host pages with * { box-sizing: border-box; margin: 0; } destroyed our overlay. Closed Shadow DOM solves it permanently, and unlike iframes, we can still observe and act on the host page's DOM.

Raw PCM beats codecs for live streaming. Gemini accepts and produces raw PCM. Adding Opus encoding/decoding would introduce latency and complexity for zero benefit. Shortest path wins.

Build the dashboard last. We almost built a React dashboard early on. Instead, we finished all features first, then wrote a single HTML file with vanilla JS that calls the same REST API. 640 lines, no build step, no dependencies, ships in the Docker image. Sometimes the boring solution is the right one.

Try It

WebClaw is open source. Clone the repo, set your GOOGLE_API_KEY, and run:

cd gateway && pip install -r requirements.txt && uvicorn main:app --port 8081
cd embed && npm install && npm run build
# Open http://localhost:8081/demo

Or deploy to Cloud Run with a single command:

cd infra && ./deploy.sh

The demo site is a fake e-commerce store (TechByte) where you can ask the agent to add products to the cart, find specific items, or navigate the FAQ.

Built by David Nzagha and the Nzagha Ventures team for the Gemini Live Agent Challenge.

WebClaw is a submission to the Gemini API Developer Competition (Live Agents category). Built with Gemini Live API, Google ADK, and Google Cloud Run.

DEV Community