lifuyuan

Posted on Mar 16

Building Wand: A Voice + Hand Pointer Live Agent with Google ADK and Gemini Live

#geminiliveagentchallenge

What if you could control your browser the way you'd direct a person — just point at something and say what you want?

That question led us to build Wand, a live AI agent that lets you browse the web entirely through voice and hand gestures. No keyboard. No mouse. Point your finger at a YouTube thumbnail and say "play this" — it clicks. Point at a map and say "zoom in here" — it scrolls. Say "what is this?" — it takes a screenshot, annotates it with your cursor position, and tells you what you're pointing at.

Here's how we built it.

The Architecture: Cloud Agent, Local Browser

The first design decision was where things live.

The agent — the part that listens, reasons, and decides what to do — runs on Google Cloud Run, powered by Google ADK and Gemini 2.5 Flash Native Audio via the Gemini Live API. This gives us a stable, always-on backend that any client can connect to without needing API keys or local GPU resources.

The browser, microphone, speaker, and webcam stay on the local machine. This is non-negotiable: Playwright needs access to real screen coordinates to click where your finger is pointing, and MediaPipe needs the webcam feed to track your hand.

These two halves communicate over a persistent WebSocket. The client streams PCM16 audio up to the server, the server streams audio responses back down, and browser actions are forwarded as JSON tool_call / tool_result messages.

Multi-Agent Design with Google ADK

Wand uses three agents, each with a well-defined domain:

concierge (root, Gemini 2.5 Flash Native Audio) — receives the voice stream and routes intent. Browser task? Transfer to browser_agent. Factual question? Call search_agent. Pure conversation? Handle it directly.
browser_agent (sub-agent, Gemini 2.5 Flash Native Audio) — controls the browser. Decides which action to take and calls remote tools (navigate, click_here, scroll_here, drag_here, screenshot).
search_agent (wrapped as an ADK AgentTool) — answers factual and real-time questions using the built-in google_search tool. Returns control to concierge after answering.

One of the most valuable things we learned from ADK is that topology matters. There are three patterns:

Sub-agent (sub_agents=[]) — full ownership transfer. The new agent takes over the conversation. Use this when the task domain fully switches.
AgentTool (AgentTool(agent=...)) — the agent is called like a function and returns a result to the caller. Use this when you need the answer back in the current context.
Direct tool — no agent, just a function. Use this for deterministic, side-effectful actions like clicking or navigating.

browser_agent is a sub-agent because the user is now in "browser mode" — the agent owns the conversation until the task is done. search_agent is an AgentTool because concierge needs the answer to continue the conversation.

The Pointer Problem: How "Here" Works

The most distinctive feature of Wand is pointer-aware actions. When the user says "click here", the agent needs to know where "here" is.

Our solution is a split: the server sends the click_here tool call with no coordinates. The client reads the cursor position locally from the hand tracker at the moment Playwright executes the click. This ensures the action always targets the freshest cursor position — not a cached value that may have drifted over the network.

For remote_screenshot, the server does read from the cursor cache (updated at 20Hz via WebSocket) — to annotate the screenshot image with a cursor dot before injecting it into Gemini's context.

Hand tracking uses MediaPipe to detect the index fingertip in each webcam frame, maps it to screen coordinates via a 4-point calibration, and streams positions at ~20Hz over WebSocket.

Making Gemini Live Stable: The Audio Gate

Running Gemini Live in a multi-agent setup introduced a subtle crash: APIError 1007.

When concierge transfers to browser_agent, there are buffered audio chunks in the pipeline that belong to the old session context. If those chunks arrive at the new agent's session, Gemini rejects them — crashing the session.

The fix is an audio gate: a per-session flag that blocks microphone audio from being sent to the ADK queue during agent handoffs. When transfer_to_agent fires:

The gate closes (allow_audio_upload = False)
The audio backlog is flushed (drop_realtime_backlog())
The gate reopens automatically after 1.25 seconds

The same gate queue is also used to inject screenshot JPEGs inline into Gemini's audio stream — so the agent can literally see the screen on demand.

Barge-In Across a Network Boundary

Gemini Live has built-in barge-in: if the user starts speaking, the agent stops. But this assumes audio input and output share the same process. In our split architecture, they don't.

When the server detects an interruption, it sends:

{"type": "interrupt"}

down the WebSocket. The client immediately clears its audio playback buffer — silence within ~43ms (one PortAudio block). This gives us barge-in that feels native even across a cloud/local boundary.

Auto-Recovery

Sessions crash. APIError 1007, network hiccups, Cloud Run cold starts — all of these disconnect the WebSocket.

The client runtime runs a persistent reconnection loop. On any disconnect, it waits 2 seconds, generates a fresh session ID, and reconnects. The server creates a new ADK session on each connection.

The UI shows “Reconnecting…” and recovers automatically — the user never needs to intervene.

What We Learned

Diagnose before fixing

Two debugging sessions were wasted on wrong hypotheses. Structured logging at key points — audio gate state, cursor updates, and agent transfer events — immediately revealed the real causes.

Prompt boundary clarity improves the whole agent team

Small ambiguities cause consistent misbehavior. Explicitly enumerating what each agent handles — including edge cases — reduced misrouting dramatically.

AEC is still an open problem

The agent's voice leaks back into the microphone and gets re-transcribed as user input.

We experimented with AEC (speexdsp) and RMS-based gating. Both approaches introduce trade-offs with barge-in responsiveness.

For now we rely on headphones as a practical workaround and consider this an open engineering problem.

Ownership boundaries matter

In a split cloud/local architecture, every feature forces an explicit decision:

Who reads the cursor?
Who owns the browser?
Who manages audio?

Getting these boundaries right early prevents a whole class of subtle bugs later.

What's Next

There are several directions we want to explore next:

MCP integration

Replace the custom WebSocket tool bridge with the Model Context Protocol so Wand's local capabilities can be reused by any MCP-compatible agent.
Eye tracking

Complement hand tracking with gaze detection as a more natural pointing modality.
Session memory

Persist conversation history in Firestore so context survives reconnects.
Low-latency video control

Build a dedicated media agent with playback-state awareness to compensate for network latency when executing commands like "pause here".

This post was created for the purposes of entering the Gemini Live Agent Hackathon.

DEV Community