I Built an AI That Sees Your Screen and Speaks Your Answers — Here's How
This post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
The Problem With Typing
Every day we spend hours switching between tabs, typing search queries, copying text, and manually reading through pages trying to find answers. What if you could just look at your screen and ask a question out loud — and get an answer spoken back to you instantly?
That's exactly what I built.
Voice UI Navigator is an AI agent that:
- 👁️ Sees your browser screen using Gemini multimodal vision
- 🎙️ Listens to your voice via the Gemini Live API
- 🔍 Searches Google in real time to research answers
- 🔊 Speaks results back to you naturally
No typing. No DOM access. No browser extensions. Just pure visual AI understanding — the same way a human would look at a screen.
Live demo: https://voice-navigator-913580598688.us-central1.run.app
GitHub: https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE
How It Works
The agent has three core capabilities wired together:
User uploads screenshot + speaks query
↓
ADK Web Server (Cloud Run)
↓
root_agent [gemini-2.0-flash-live-001]
↓ ↓
analyze_screenshot() google_search()
↓ ↓
Gemini Vision Google Search API
(reads pixels) (real-time results)
↓ ↓
Voice response spoken back to user
1. Screen Vision (No DOM Required)
The user takes a screenshot of their browser and attaches it in the chat. The agent calls analyze_screenshot(), which sends the image to gemini-2.0-flash with a structured prompt asking it to identify:
- Page type and title
- Visible UI elements (buttons, links, inputs)
- Main content summary
- Suggested next actions
The key insight: Gemini doesn't need DOM access to understand a UI. It reads pixels the way a human does — and it's surprisingly accurate.
2. Real-Time Voice (Gemini Live API)
The agent runs on gemini-2.0-flash-live-001, which supports bidirectional audio streaming. Google's ADK handles the /run_live WebSocket endpoint automatically — users just click the microphone button and start talking. The agent can be interrupted mid-sentence, just like a real conversation.
3. Google Search Grounding
When the user asks about something that needs current information, the ADK google_search tool kicks in — pulling real-time web results and weaving them into the spoken response.
Tech Stack
| Component | Technology |
|---|---|
| Agent Framework | Google ADK v1.25.1 |
| Live Voice Model | gemini-2.0-flash-live-001 |
| Vision Model | gemini-2.0-flash |
| Search | ADK google_search tool |
| Hosting | Google Cloud Run |
| Container Registry | Google Artifact Registry |
| CI/CD | Google Cloud Build |
| Language | Python 3.11 |
Building It: The Code
The entire agent lives in two main components.
The Agent (app/agent.py)
from google.adk.agents import Agent
from google.adk.tools import google_search
from google.genai import types
root_agent = Agent(
name="voice_ui_navigator",
model="gemini-2.0-flash-live-001",
description="Voice-powered agent that sees your screen and searches the web.",
instruction="""You are a Voice UI Navigator.
When the user shares a screenshot, call analyze_screenshot.
Use google_search for research questions.
Always respond conversationally — you are speaking to the user.
Never access the DOM. Read screens visually only.""",
tools=[analyze_screenshot, google_search],
generate_content_config=types.GenerateContentConfig(
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(
voice_name="Puck"
)
)
)
),
)
The Vision Tool (analyze_screenshot)
async def analyze_screenshot(tool_context: ToolContext) -> dict:
# Load the screenshot the user attached in the chat
screenshot_part = await tool_context.load_artifact("screenshot.png")
# Fall back: find any image artifact in the session
if screenshot_part is None:
artifact_names = await tool_context.list_artifacts()
image_artifacts = [n for n in artifact_names
if n.lower().endswith((".png", ".jpg"))]
if image_artifacts:
screenshot_part = await tool_context.load_artifact(image_artifacts[-1])
# Send image + structured prompt to Gemini vision
client = Client(api_key=os.environ["GEMINI_API_KEY"])
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=[types.Content(role="user", parts=[
screenshot_part,
types.Part.from_text(analysis_prompt)
])]
)
return json.loads(response.text)
The trick here is ADK's artifact system. When a user attaches a file in the ADK web UI, it's automatically stored as a session artifact. The tool retrieves it with tool_context.load_artifact() — no custom file upload endpoint needed.
Deploying to Google Cloud Run
The entire deployment is containerized with Docker and deployed to Cloud Run.
Dockerfile
FROM python:3.11-slim
WORKDIR /workspace
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ ./app/
EXPOSE 8080
CMD ["adk", "web", ".", "--host", "0.0.0.0", "--port", "8080"]
Important: Run adk web . from the parent directory of your agent folder — not from inside it. ADK scans for agent packages one level down from where the command runs.
Build and Deploy
# Build and push to Artifact Registry
gcloud builds submit \
--tag us-central1-docker.pkg.dev/YOUR_PROJECT/voice-navigator-repo/voice-navigator
# Deploy to Cloud Run
gcloud run deploy voice-navigator \
--image us-central1-docker.pkg.dev/YOUR_PROJECT/voice-navigator-repo/voice-navigator \
--platform managed \
--region us-central1 \
--allow-unauthenticated \
--port 8080 \
--set-env-vars "GEMINI_API_KEY=your_key,GOOGLE_GENAI_USE_VERTEXAI=False"
Lessons Learned (The Hard Way)
1. ADK directory structure is strict
ADK's web loader scans ALL subdirectories of the agents folder looking for agent packages. I had a tools/ subfolder inside my app/ agent directory — ADK tried to load it as a separate agent and threw:
No root_agent found for 'tools'.
Fix: Move all tools into the main agent.py file, removing any subdirectories inside the agent package.
2. Not all Gemini models support Live API
I wasted time with gemini-live-2.5-flash-native-audio (doesn't exist), gemini-1.5-flash (no live support), and gemini-2.0-flash (no live support). Only gemini-2.0-flash-live-001 works with ADK's /run_live WebSocket for real-time audio.
3. Cloud Build uses the Compute Engine service account
When gcloud builds submit fails with permission denied on Artifact Registry, the fix is NOT granting the Cloud Build service account — it's granting the Compute Engine default service account:
gcloud artifacts repositories add-iam-policy-binding voice-navigator-repo \
--location=us-central1 \
--member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
--role="roles/artifactregistry.writer"
4. gcr.io is deprecated
Google Container Registry (gcr.io) is being replaced by Artifact Registry (pkg.dev). Use Artifact Registry for new projects — gcr.io pushes will fail silently on newer GCP projects.
5. Separate the vision call from the live audio session
Initially I tried to have the live model handle both audio streaming AND vision analysis simultaneously. This caused instability. The cleaner pattern: make a separate synchronous gemini-2.0-flash call inside the tool for image analysis, while the live session stays focused on audio I/O.
What Surprised Me About Gemini Vision
I expected to need DOM access or accessibility APIs to understand UI elements. I was wrong.
Given just a raw screenshot, gemini-2.0-flash correctly identified:
- Button labels and their positions on screen
- Navigation menus and their items
- Form fields and their purposes
- The page's primary content and intent
- Actionable next steps the user could take
This opens up a genuinely powerful use case: an AI that works on ANY screen — web apps, desktop software, mobile screenshots — without needing any special integration or API access.
What's Next
- Browser extension — automatically capture screenshots without manual attachment
- Action execution — integrate with Playwright to actually perform the suggested navigation steps
- Multi-turn screen memory — remember previous screenshots to understand navigation flow over time
- Mobile support — accept screenshots from phone cameras for on-device assistance
Try It Yourself
git clone https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE.git
cd "GEMINI_CODING_CHALLENGE/ADK-STREAMING"
pip install -r requirements.txt
# Add your Gemini API key to app/.env
echo "GEMINI_API_KEY=your_key_here" > app/.env
echo "GOOGLE_GENAI_USE_VERTEXAI=False" >> app/.env
adk web . --no-reload
Open http://localhost:8000, attach a screenshot, and ask the agent what it sees.
https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE/blob/master/ADK-STREAMING/deploy.sh
https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE/blob/master/ADK-STREAMING/cloudbuild.yaml
Resources
Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge
If you found this useful, drop a ❤️ and follow for more AI agent builds!
Top comments (0)