I created this piece of content for the purposes of entering the Gemini Live Agent Challenge hackathon.
Itโs currently 1:00 AM in Dhaka. My terminal is a wall of green and red logs, my coffee is cold, and I am about to submit my project for the Google Gemini Live Agent Challenge.
Over the last few days, Iโve been building IAN (Intelligent Accessibility Navigator). Itโs a multimodal AI Agent designed to browse the internet for you using just your voice.
If you are breaking into tech, or if you are one of the hackathon judges reading this, I want to take you behind the scenes of how I built this, the late-night architecture pivots, and how I managed to stop my headless browsers from crashing my server.
The Broken Web: Why We Need a New Approach to Web Accessibility โฟ
If you have ever tried using a traditional screen reader on a modern e-commerce site, you know itโs a nightmare.
Traditional screen readers rely entirely on parsing the Document Object Model (DOM). But todayโs web is incredibly messy. Itโs filled with dynamically injected
tags masquerading as buttons, missing ARIA labels, and complex Single Page Applications (SPAs). When a visually impaired user tries to navigate a notoriously clunky real-world website, the screen reader just yells a wall of meaningless code at them.Web Accessibility shouldn't rely on perfect HTML. I realized something: Humans don't parse the DOM to use a website. We just look at the screen. With the new multimodal Gemini models, I wondered: Could I build an AI that does exactly the same thing?
Enter IAN: A Voice-Controlled AI Agent ๐๏ธ
IAN is a Next-Gen Agent that acts as a digital equalizer. Instead of navigating with a keyboard or relying on HTML tags, you use a high-contrast, Neo-Brutalist React dashboard. You hold down a button, speak naturally ("Hey, go to Amazon and search for running shoes"), and the AI physically takes over the browser for you.
To build this at "startup speed," I leaned heavily on the Google Agent Development Kit (ADK) and some rapid "vibe coding" for the frontend using Google's experimental agentic IDEs.
The Dual-Model Architecture (And Bypassing the DOM) ๐ง
Building AI Agents that run in real-time is tricky. If the agent is busy taking a screenshot, it can't listen to your voice. To fix this, I split IAN's brain into two distinct pieces.

1. Voice Intent (Gemini 2.5 Native Audio)
When you speak into the React frontend, the raw audio is streamed via WebSockets to my backend. Using the ADK's InMemorySessionService, I pass this audio directly into gemini-2.5-flash-native-audio.
This model is incredibly fast. It acts as my "Audio Orchestrator." Its only job is to detect Voice Activity (VAD), figure out what the user wants, and output a strict, silent tag like: [NAVIGATE: search amazon for shoes].
2. Visual Action (Playwright Automation)
Once my backend intercepts that [NAVIGATE] tag, it spins up a background thread running Playwright automation.
A headless Chromium browser opens the website, takes a screenshot, and sends it to gemini-2.5-flash (the vision model). The AI looks at the pixels, calculates the exact $(X, Y)$ coordinates of the search bar, and tells Playwright to physically click and type. It completely ignores the messy DOM!
Surviving the Hackathon: Beating Rate Limits & Bugs ๐
Of course, it wasn't all smooth sailing. Here are the two massive roadblocks I hit and how I solved them:
Taming the "Concurrency Explosion" on Google Cloud Run โ๏ธ
I deployed my FastAPI backend to Google Cloud Run to ensure it could scale. However, headless Playwright browsers consume a lot of memory.
During testing, if I gave the agent two commands too quickly, it would spawn multiple browsers, eat all my RAM, and instantly hit Gemini's 429 RESOURCE_EXHAUSTED API rate limit.
The Fix: I implemented a strict asyncio.Lock() in Python to prevent ghost-browsers from spawning. I also built a 3.5-second "pacemaker" into the visual reasoning loop. This forced the agent to process exactly 15 actions per minute, keeping me perfectly safe within the free-tier API quotas without dropping the WebSocket connection!
The Float32 Audio Nightmare ๐ง
Browsers capture microphone audio in Float32 format at 44.1kHz. But the Gemini Native Audio model strictly requires 16kHz, 16-bit PCM audio. For hours, the LLM was just hallucinating static noise.
I had to dive deep into the Web Audio API and write a custom JavaScript processor to manually downsample and clamp the audio chunks on the fly before sending them over the WebSocket.
Advice for Developers Breaking Into Tech
If you are just starting out and looking at complex architectures like this, don't be intimidated.
A week ago, I had never used the Google ADK, and my WebSockets kept crashing. Hackathons are the ultimate forcing function. You learn by breaking things, reading the docs, and fixing them at 2:00 AM. If you want to break into tech, stop doing tutorials and go enter a hackathon. The pressure makes you a better developer.
The Final Demo & What's Next ๐ฎ
This hackathon was an incredible experience, but it's just a proof of concept. The next step is migrating this logic from a Cloud Run proxy into a local Chrome Extension, allowing IAN to drive your actual local browser securely.
The era of struggling with inaccessible HTML is over. If you can see it on the screen, AI can click it for you.
๐บ Watch the live demo here: YouTube Demo Link
๐ป Check out the Open-Source Code: GitHub Repository
Wish me luck with the judges!
Top comments (0)