<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: lifuyuan</title>
    <description>The latest articles on DEV Community by lifuyuan (@lifuyuan).</description>
    <link>https://dev.to/lifuyuan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828210%2Ff8098c8a-383c-4a12-9ffe-a44da25697c6.png</url>
      <title>DEV Community: lifuyuan</title>
      <link>https://dev.to/lifuyuan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lifuyuan"/>
    <language>en</language>
    <item>
      <title>Building Wand: A Voice + Hand Pointer Live Agent with Google ADK and Gemini Live</title>
      <dc:creator>lifuyuan</dc:creator>
      <pubDate>Mon, 16 Mar 2026 23:23:10 +0000</pubDate>
      <link>https://dev.to/lifuyuan/building-wand-a-voice-hand-pointer-live-agent-with-google-adk-and-gemini-live-2fp7</link>
      <guid>https://dev.to/lifuyuan/building-wand-a-voice-hand-pointer-live-agent-with-google-adk-and-gemini-live-2fp7</guid>
      <description>&lt;p&gt;What if you could control your browser the way you'd direct a person — just point at something and say what you want?&lt;/p&gt;

&lt;p&gt;That question led us to build &lt;strong&gt;Wand&lt;/strong&gt;, a live AI agent that lets you browse the web entirely through voice and hand gestures. No keyboard. No mouse. Point your finger at a YouTube thumbnail and say "play this" — it clicks. Point at a map and say "zoom in here" — it scrolls. Say "what is this?" — it takes a screenshot, annotates it with your cursor position, and tells you what you're pointing at.&lt;/p&gt;

&lt;p&gt;Here's how we built it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Cloud Agent, Local Browser
&lt;/h2&gt;

&lt;p&gt;The first design decision was where things live.&lt;/p&gt;

&lt;p&gt;The agent — the part that listens, reasons, and decides what to do — runs on &lt;strong&gt;Google Cloud Run&lt;/strong&gt;, powered by &lt;strong&gt;Google ADK&lt;/strong&gt; and &lt;strong&gt;Gemini 2.5 Flash Native Audio&lt;/strong&gt; via the Gemini Live API. This gives us a stable, always-on backend that any client can connect to without needing API keys or local GPU resources.&lt;/p&gt;

&lt;p&gt;The browser, microphone, speaker, and webcam stay on the &lt;strong&gt;local machine&lt;/strong&gt;. This is non-negotiable: Playwright needs access to real screen coordinates to click where your finger is pointing, and MediaPipe needs the webcam feed to track your hand.&lt;/p&gt;

&lt;p&gt;These two halves communicate over a persistent WebSocket. The client streams PCM16 audio up to the server, the server streams audio responses back down, and browser actions are forwarded as JSON &lt;code&gt;tool_call&lt;/code&gt; / &lt;code&gt;tool_result&lt;/code&gt; messages.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Agent Design with Google ADK
&lt;/h2&gt;

&lt;p&gt;Wand uses three agents, each with a well-defined domain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;concierge&lt;/code&gt;&lt;/strong&gt; (root, Gemini 2.5 Flash Native Audio) — receives the voice stream and routes intent. Browser task? Transfer to &lt;code&gt;browser_agent&lt;/code&gt;. Factual question? Call &lt;code&gt;search_agent&lt;/code&gt;. Pure conversation? Handle it directly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;browser_agent&lt;/code&gt;&lt;/strong&gt; (sub-agent, Gemini 2.5 Flash Native Audio) — controls the browser. Decides which action to take and calls remote tools (&lt;code&gt;navigate&lt;/code&gt;, &lt;code&gt;click_here&lt;/code&gt;, &lt;code&gt;scroll_here&lt;/code&gt;, &lt;code&gt;drag_here&lt;/code&gt;, &lt;code&gt;screenshot&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;search_agent&lt;/code&gt;&lt;/strong&gt; (wrapped as an ADK &lt;code&gt;AgentTool&lt;/code&gt;) — answers factual and real-time questions using the built-in &lt;code&gt;google_search&lt;/code&gt; tool. Returns control to &lt;code&gt;concierge&lt;/code&gt; after answering.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the most valuable things we learned from ADK is that &lt;strong&gt;topology matters&lt;/strong&gt;. There are three patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sub-agent&lt;/strong&gt; (&lt;code&gt;sub_agents=[]&lt;/code&gt;) — full ownership transfer. The new agent takes over the conversation. Use this when the task domain fully switches.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AgentTool&lt;/strong&gt; (&lt;code&gt;AgentTool(agent=...)&lt;/code&gt;) — the agent is called like a function and returns a result to the caller. Use this when you need the answer back in the current context.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct tool&lt;/strong&gt; — no agent, just a function. Use this for deterministic, side-effectful actions like clicking or navigating.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;code&gt;browser_agent&lt;/code&gt; is a sub-agent because the user is now in "browser mode" — the agent owns the conversation until the task is done. &lt;code&gt;search_agent&lt;/code&gt; is an AgentTool because &lt;code&gt;concierge&lt;/code&gt; needs the answer to continue the conversation.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pointer Problem: How "Here" Works
&lt;/h2&gt;

&lt;p&gt;The most distinctive feature of Wand is pointer-aware actions. When the user says "click here", the agent needs to know where "here" is.&lt;/p&gt;

&lt;p&gt;Our solution is a split: the server sends the &lt;code&gt;click_here&lt;/code&gt; tool call &lt;strong&gt;with no coordinates&lt;/strong&gt;. The client reads the cursor position &lt;strong&gt;locally&lt;/strong&gt; from the hand tracker at the moment Playwright executes the click. This ensures the action always targets the freshest cursor position — not a cached value that may have drifted over the network.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;remote_screenshot&lt;/code&gt;, the server does read from the cursor cache (updated at 20Hz via WebSocket) — to annotate the screenshot image with a cursor dot before injecting it into Gemini's context.&lt;/p&gt;

&lt;p&gt;Hand tracking uses MediaPipe to detect the index fingertip in each webcam frame, maps it to screen coordinates via a 4-point calibration, and streams positions at ~20Hz over WebSocket.&lt;/p&gt;




&lt;h2&gt;
  
  
  Making Gemini Live Stable: The Audio Gate
&lt;/h2&gt;

&lt;p&gt;Running Gemini Live in a multi-agent setup introduced a subtle crash: &lt;strong&gt;APIError 1007&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;concierge&lt;/code&gt; transfers to &lt;code&gt;browser_agent&lt;/code&gt;, there are buffered audio chunks in the pipeline that belong to the old session context. If those chunks arrive at the new agent's session, Gemini rejects them — crashing the session.&lt;/p&gt;

&lt;p&gt;The fix is an &lt;strong&gt;audio gate&lt;/strong&gt;: a per-session flag that blocks microphone audio from being sent to the ADK queue during agent handoffs. When &lt;code&gt;transfer_to_agent&lt;/code&gt; fires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The gate closes (&lt;code&gt;allow_audio_upload = False&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The audio backlog is flushed (&lt;code&gt;drop_realtime_backlog()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The gate reopens automatically after 1.25 seconds&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The same gate queue is also used to inject screenshot JPEGs inline into Gemini's audio stream — so the agent can literally see the screen on demand.&lt;/p&gt;




&lt;h2&gt;
  
  
  Barge-In Across a Network Boundary
&lt;/h2&gt;

&lt;p&gt;Gemini Live has built-in barge-in: if the user starts speaking, the agent stops. But this assumes audio input and output share the same process. In our split architecture, they don't.&lt;/p&gt;

&lt;p&gt;When the server detects an interruption, it sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"interrupt"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;down the WebSocket. The client immediately clears its audio playback buffer — silence within ~43ms (one PortAudio block). This gives us barge-in that feels native even across a cloud/local boundary.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auto-Recovery
&lt;/h2&gt;

&lt;p&gt;Sessions crash. APIError 1007, network hiccups, Cloud Run cold starts — all of these disconnect the WebSocket.&lt;/p&gt;

&lt;p&gt;The client runtime runs a persistent reconnection loop. On any disconnect, it waits 2 seconds, generates a fresh session ID, and reconnects. The server creates a new ADK session on each connection.&lt;/p&gt;

&lt;p&gt;The UI shows &lt;strong&gt;“Reconnecting…”&lt;/strong&gt; and recovers automatically — the user never needs to intervene.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Diagnose before fixing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two debugging sessions were wasted on wrong hypotheses. Structured logging at key points — audio gate state, cursor updates, and agent transfer events — immediately revealed the real causes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt boundary clarity improves the whole agent team&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Small ambiguities cause consistent misbehavior. Explicitly enumerating what each agent handles — including edge cases — reduced misrouting dramatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AEC is still an open problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent's voice leaks back into the microphone and gets re-transcribed as user input.&lt;br&gt;&lt;br&gt;
We experimented with AEC (speexdsp) and RMS-based gating. Both approaches introduce trade-offs with barge-in responsiveness.&lt;/p&gt;

&lt;p&gt;For now we rely on headphones as a practical workaround and consider this an open engineering problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ownership boundaries matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In a split cloud/local architecture, every feature forces an explicit decision:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who reads the cursor?&lt;/li&gt;
&lt;li&gt;Who owns the browser?&lt;/li&gt;
&lt;li&gt;Who manages audio?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Getting these boundaries right early prevents a whole class of subtle bugs later.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;There are several directions we want to explore next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MCP integration&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Replace the custom WebSocket tool bridge with the Model Context Protocol so Wand's local capabilities can be reused by any MCP-compatible agent.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Eye tracking&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Complement hand tracking with gaze detection as a more natural pointing modality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Session memory&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Persist conversation history in Firestore so context survives reconnects.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Low-latency video control&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Build a dedicated media agent with playback-state awareness to compensate for network latency when executing commands like "pause here".&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;This post was created for the purposes of entering the Gemini Live Agent Hackathon.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>geminiliveagentchallenge</category>
    </item>
  </channel>
</rss>
