Sherin Joseph Roy

Posted on Mar 16

Building a Voice-Controlled Browser Agent with Three Gemini Models

#ai #gemini #google #geminiliveagentchallenge

This post was written as part of my submission to the Gemini Live Agent Challenge hackathon on Devpost. #GeminiLiveAgentChallenge

The Problem That Started This

My grandmother owns a smartphone. She has a broadband connection. She cannot book a train ticket online.

This is not a technology access problem. She has the hardware and the connectivity. What she cannot do is navigate a website. She does not understand dropdown menus. She cannot read small text on form labels. She does not know what "Enter OTP" means. When something goes wrong, she sees an error message she cannot parse and hands the phone to someone younger.

She is not alone. According to government data, 85% of India's elderly population cannot independently use digital services. Over 900 million people globally are in the same situation. India moved pensions online, digitized Aadhaar, made train booking web-only, and shifted bill payments to portals. The interfaces got built. The people who need these services the most got left behind.

I wanted to build something that removes the interface from the equation entirely. Not a simpler interface. Not a tutorial. A system where the user speaks what they need and the computer handles the rest.

That project became SAHAY.

What SAHAY Does

SAHAY listens to the user in their language. Hindi, Malayalam, Tamil, Telugu, English, or any of 24 supported languages. It opens a real Chromium browser, finds the correct website, navigates through the pages, fills forms, clicks buttons, and speaks the results back.

The user says "Amazon par earbuds dikhao 1000 rupaye se kam" and SAHAY opens Amazon with a price-filtered search, reads the results, and reports the top options with prices. In Hindi. Because that is the language the user spoke.

The user says "Download my Aadhaar card" and SAHAY navigates to the UIDAI portal, asks for the Aadhaar number, repeats it back digit by digit for confirmation, enters it, and proceeds through the OTP flow.

Before any login, payment, or form submission, SAHAY stops and asks for permission. The user confirms by voice or by clicking a button. For passwords and CAPTCHAs, the user can take direct control by clicking on the browser screen.

The Three-Agent Architecture

SAHAY runs three separate Gemini agents that coordinate to complete each task.

Agent 1: The Planner

The Planner runs on Gemini 2.5 Flash with Google Search grounding through the GenAI SDK. When the user describes what they want, the Planner searches the internet in real time to find the correct website. It does not use hardcoded URLs. It does not rely on a static list of known portals. It searches, reads the results, and identifies the right destination.

This matters because websites change. The UIDAI download page moved URLs twice in the past year. Government portals restructure their navigation without warning. A hardcoded URL from last month might 404 today. The Planner always researches the current state before creating a plan.

After finding the target, the Planner creates a structured execution plan with step-by-step instructions, visual descriptions of what to look for on each page, and flags for which steps involve sensitive data.

The Planner is a separate ADK agent because Google Search grounding cannot be combined with other tools in the same agent. This constraint shaped the architecture. It turned out to be the right design anyway because it cleanly separates research from execution.

from google import genai
from google.genai import types

client = genai.Client(vertexai=True)
response = await client.aio.models.generate_content(
    model="gemini-2.5-flash",
    contents=task_prompt,
    config=types.GenerateContentConfig(
        tools=[types.Tool(google_search=types.GoogleSearch())],
        temperature=0.2,
    ),
)

Agent 2: The Browser

The Browser Agent runs on Gemini's Computer Use model (gemini-2.5-computer-use-preview-10-2025). It receives the plan from the Planner and executes it step by step.

The execution loop works like this:

Take a screenshot of the current browser state via Playwright
Send the screenshot to the Computer Use model along with the current plan step
The model analyzes the screenshot visually and returns coordinates for where to click or what to type
Playwright executes the action against the real browser
Take a new screenshot
Repeat until the task is complete

The Browser Agent does not read HTML. It does not use CSS selectors for understanding page layout. It does not call any website APIs. It looks at the screenshot the same way a person would look at a screen and decides what to do next. This means it works on any website without site-specific configuration.

The Computer Use model outputs normalized coordinates (0 to 999). SAHAY converts these to actual pixel positions on the 1440x900 viewport:

actual_x = int(normalized_x / 1000 * 1440)
actual_y = int(normalized_y / 1000 * 900)
await page.mouse.click(actual_x, actual_y)

The Browser Agent is wrapped in ADK's ComputerUseToolset, which manages the screenshot-action loop and handles the coordinate conversion.

Agent 3: The Voice

The Voice Agent runs on Gemini 2.5 Flash Native Audio through the Live API. It handles all communication with the user through bidirectional audio streaming.

The user's microphone audio is captured by the browser using AudioWorklet, converted to PCM 16-bit 16kHz mono, and streamed over a WebSocket to the FastAPI backend. The backend pipes this audio into the Live API session. When Gemini responds with audio, it streams back through the same WebSocket to the browser for playback.

The Voice Agent automatically detects the user's language and responds in the same language. If the user starts in English and switches to Hindi mid-sentence, the response comes back in Hindi. No language selection menu. No configuration.

For sensitive inputs like Aadhaar numbers and phone numbers, the Voice Agent repeats back what it heard and waits for explicit confirmation before passing the data to the Browser Agent. This prevents the misheard-digit problem that would otherwise cause the entire flow to fail silently.

voice_agent = Agent(
    name="sahay_voice_agent",
    model="gemini-live-2.5-flash-native-audio",
    instruction=VOICE_INSTRUCTION,
    tools=[plan_task, browser_action, stop_task, rollback],
)

Google Cloud Services

SAHAY uses three Google Cloud services in production.

Vertex AI hosts all three Gemini model endpoints. The Voice Agent connects to the Live API through Vertex AI. The Planner Agent calls Gemini Flash with Google Search grounding through Vertex AI. The Computer Use model calls go through the Gemini API directly using an API key.

Cloud Firestore stores task logs, session state, and workflow recordings. Every task gets a document with the task description, each step taken, screenshots at key moments, the final outcome, and timestamps. This serves as an audit trail and also powers the workflow replay feature where repeated tasks execute faster by following a previously recorded path.

Cloud Run hosts the containerized application. The Dockerfile installs Playwright and Chromium inside the container, so the browser automation works in the cloud environment. The deploy script and Terraform configuration automate the entire deployment process.

What Made This Hard

Google CAPTCHA. The first version of SAHAY used headless Chromium to search Google directly. After a few searches, Google would show a CAPTCHA and the agent would get stuck on the verification page, clicking randomly and wasting steps. Moving search to the Planner Agent via Google Search grounding API eliminated this problem completely. The browser never touches Google Search anymore.

Bot detection. IRCTC, MakeMyTrip, and several banking portals detect Playwright and refuse to load the page. Stealth flags and spoofed user agents helped with some sites but not all. The solution was building a smart browser selection system. SAHAY analyzes the target URL and task description and decides whether to use headless Chromium (fast, works for most sites) or a headed browser window (slower, but bypasses bot detection on protected sites). This decision happens automatically per task.

The Computer Use model finishing early. The ADK runner's run_async() generator exits when the model returns a text response without a function call. The model would sometimes describe what it sees on screen instead of clicking on it, which would end the task prematurely after two or three steps. The fix was a continuation loop that detects when the model exits without reporting completion, re-prompts it with "You have not finished the task. Take an action.", and resumes execution. This loop runs up to three times before giving up.

Voice number accuracy. The Live API voice model occasionally mishears digits. "9895" becomes "9985". For an Aadhaar number, a single wrong digit means the download fails and the user does not understand why. The repeat-back-and-confirm pattern solved this. It adds a few seconds to each interaction but prevents silent failures that would destroy user trust.

The Stack

Component	Technology
Agent framework	Google ADK
Voice model	Gemini 2.5 Flash Native Audio (Live API)
Browser model	Gemini 2.5 Computer Use
Planner model	Gemini 2.5 Flash + Google Search
Browser automation	Playwright (Chromium)
Backend	FastAPI + WebSocket
Frontend	Vanilla JavaScript
Database	Google Cloud Firestore
Hosting	Google Cloud Run
IaC	Terraform

What I Would Do Differently

The visual-only approach is the right architectural choice for universality but the wrong choice for speed. Every action requires a full screenshot capture, a round trip to the Gemini API, and coordinate parsing. A hybrid approach that uses visual understanding for navigation decisions but DOM selectors for precise form filling would be significantly faster.

The three-agent architecture introduces latency at the boundaries. The Planner takes 5 to 15 seconds to research and produce a plan. During this time, the browser sits idle and the user waits in silence. Pre-fetching the target URL while the Planner is still working would cut perceived latency in half.

The continuation loop is a workaround for a fundamental issue with how the Computer Use model signals task completion. A better approach would be fine-tuning the model prompt so it always ends with either a function call or an explicit completion message, never a bare text description.

Source Code

The full source code is available at:
github.com/Sherin-SEF-AI/Sahay-Voice-First-Digital-Navigator

Built by Sherin Joseph Roy, Head of Products at DeepMost AI.

This project was built for the Gemini Live Agent Challenge hackathon, UI Navigator track.

DEV Community