Kamaumbugua-dev

Posted on Feb 26

I Built an AI That Sees Your Screen and Speaks Your Answers, Here's How

#googleai #python #machinelearning #showdev

I Built an AI That Sees Your Screen and Speaks Your Answers — Here's How

This post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem With Typing

Every day we spend hours switching between tabs, typing search queries, copying text, and manually reading through pages trying to find answers. What if you could just look at your screen and ask a question out loud — and get an answer spoken back to you instantly?

That's exactly what I built.

Voice UI Navigator is an AI agent that:

👁️ Sees your browser screen using Gemini multimodal vision
🎙️ Listens to your voice via the Gemini Live API
🔍 Searches Google in real time to research answers
🔊 Speaks results back to you naturally

No typing. No DOM access. No browser extensions. Just pure visual AI understanding — the same way a human would look at a screen.

Live demo: https://voice-navigator-913580598688.us-central1.run.app
GitHub: https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE

How It Works

The agent has three core capabilities wired together:

User uploads screenshot + speaks query
              ↓
       ADK Web Server (Cloud Run)
              ↓
    root_agent [gemini-2.0-flash-live-001]
         ↓                    ↓
analyze_screenshot()     google_search()
         ↓                    ↓
  Gemini Vision         Google Search API
  (reads pixels)        (real-time results)
         ↓                    ↓
     Voice response spoken back to user

1. Screen Vision (No DOM Required)

The user takes a screenshot of their browser and attaches it in the chat. The agent calls analyze_screenshot(), which sends the image to gemini-2.0-flash with a structured prompt asking it to identify:

Page type and title
Visible UI elements (buttons, links, inputs)
Main content summary
Suggested next actions

The key insight: Gemini doesn't need DOM access to understand a UI. It reads pixels the way a human does — and it's surprisingly accurate.

2. Real-Time Voice (Gemini Live API)

The agent runs on gemini-2.0-flash-live-001, which supports bidirectional audio streaming. Google's ADK handles the /run_live WebSocket endpoint automatically — users just click the microphone button and start talking. The agent can be interrupted mid-sentence, just like a real conversation.

3. Google Search Grounding

When the user asks about something that needs current information, the ADK google_search tool kicks in — pulling real-time web results and weaving them into the spoken response.

Tech Stack

Component	Technology
Agent Framework	Google ADK v1.25.1
Live Voice Model	`gemini-2.0-flash-live-001`
Vision Model	`gemini-2.0-flash`
Search	ADK `google_search` tool
Hosting	Google Cloud Run
Container Registry	Google Artifact Registry
CI/CD	Google Cloud Build
Language	Python 3.11

Building It: The Code

The entire agent lives in two main components.

The Agent (`app/agent.py`)

from google.adk.agents import Agent
from google.adk.tools import google_search
from google.genai import types

root_agent = Agent(
    name="voice_ui_navigator",
    model="gemini-2.0-flash-live-001",
    description="Voice-powered agent that sees your screen and searches the web.",
    instruction="""You are a Voice UI Navigator.
    When the user shares a screenshot, call analyze_screenshot.
    Use google_search for research questions.
    Always respond conversationally — you are speaking to the user.
    Never access the DOM. Read screens visually only.""",
    tools=[analyze_screenshot, google_search],
    generate_content_config=types.GenerateContentConfig(
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(
                    voice_name="Puck"
                )
            )
        )
    ),
)

The Vision Tool (`analyze_screenshot`)

async def analyze_screenshot(tool_context: ToolContext) -> dict:
    # Load the screenshot the user attached in the chat
    screenshot_part = await tool_context.load_artifact("screenshot.png")

    # Fall back: find any image artifact in the session
    if screenshot_part is None:
        artifact_names = await tool_context.list_artifacts()
        image_artifacts = [n for n in artifact_names
                          if n.lower().endswith((".png", ".jpg"))]
        if image_artifacts:
            screenshot_part = await tool_context.load_artifact(image_artifacts[-1])

    # Send image + structured prompt to Gemini vision
    client = Client(api_key=os.environ["GEMINI_API_KEY"])
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=[types.Content(role="user", parts=[
            screenshot_part,
            types.Part.from_text(analysis_prompt)
        ])]
    )
    return json.loads(response.text)

The trick here is ADK's artifact system. When a user attaches a file in the ADK web UI, it's automatically stored as a session artifact. The tool retrieves it with tool_context.load_artifact() — no custom file upload endpoint needed.

Deploying to Google Cloud Run

The entire deployment is containerized with Docker and deployed to Cloud Run.

Dockerfile

FROM python:3.11-slim
WORKDIR /workspace
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app/ ./app/
EXPOSE 8080
CMD ["adk", "web", ".", "--host", "0.0.0.0", "--port", "8080"]

Important: Run adk web . from the parent directory of your agent folder — not from inside it. ADK scans for agent packages one level down from where the command runs.

Build and Deploy

# Build and push to Artifact Registry
gcloud builds submit \
  --tag us-central1-docker.pkg.dev/YOUR_PROJECT/voice-navigator-repo/voice-navigator

# Deploy to Cloud Run
gcloud run deploy voice-navigator \
  --image us-central1-docker.pkg.dev/YOUR_PROJECT/voice-navigator-repo/voice-navigator \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --port 8080 \
  --set-env-vars "GEMINI_API_KEY=your_key,GOOGLE_GENAI_USE_VERTEXAI=False"

Lessons Learned (The Hard Way)

1. ADK directory structure is strict

ADK's web loader scans ALL subdirectories of the agents folder looking for agent packages. I had a tools/ subfolder inside my app/ agent directory — ADK tried to load it as a separate agent and threw:

No root_agent found for 'tools'.

Fix: Move all tools into the main agent.py file, removing any subdirectories inside the agent package.

2. Not all Gemini models support Live API

I wasted time with gemini-live-2.5-flash-native-audio (doesn't exist), gemini-1.5-flash (no live support), and gemini-2.0-flash (no live support). Only gemini-2.0-flash-live-001 works with ADK's /run_live WebSocket for real-time audio.

3. Cloud Build uses the Compute Engine service account

When gcloud builds submit fails with permission denied on Artifact Registry, the fix is NOT granting the Cloud Build service account — it's granting the Compute Engine default service account:

gcloud artifacts repositories add-iam-policy-binding voice-navigator-repo \
  --location=us-central1 \
  --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
  --role="roles/artifactregistry.writer"

4. gcr.io is deprecated

Google Container Registry (gcr.io) is being replaced by Artifact Registry (pkg.dev). Use Artifact Registry for new projects — gcr.io pushes will fail silently on newer GCP projects.

5. Separate the vision call from the live audio session

Initially I tried to have the live model handle both audio streaming AND vision analysis simultaneously. This caused instability. The cleaner pattern: make a separate synchronous gemini-2.0-flash call inside the tool for image analysis, while the live session stays focused on audio I/O.

What Surprised Me About Gemini Vision

I expected to need DOM access or accessibility APIs to understand UI elements. I was wrong.

Given just a raw screenshot, gemini-2.0-flash correctly identified:

Button labels and their positions on screen
Navigation menus and their items
Form fields and their purposes
The page's primary content and intent
Actionable next steps the user could take

This opens up a genuinely powerful use case: an AI that works on ANY screen — web apps, desktop software, mobile screenshots — without needing any special integration or API access.

What's Next

Browser extension — automatically capture screenshots without manual attachment
Action execution — integrate with Playwright to actually perform the suggested navigation steps
Multi-turn screen memory — remember previous screenshots to understand navigation flow over time
Mobile support — accept screenshots from phone cameras for on-device assistance

Try It Yourself

git clone https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE.git
cd "GEMINI_CODING_CHALLENGE/ADK-STREAMING"
pip install -r requirements.txt

# Add your Gemini API key to app/.env
echo "GEMINI_API_KEY=your_key_here" > app/.env
echo "GOOGLE_GENAI_USE_VERTEXAI=False" >> app/.env

adk web . --no-reload

Open http://localhost:8000, attach a screenshot, and ask the agent what it sees.

https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE/blob/master/ADK-STREAMING/deploy.sh

https://github.com/Kamaumbugua-dev/GEMINI_CODING_CHALLENGE/blob/master/ADK-STREAMING/cloudbuild.yaml

Resources

Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

If you found this useful, drop a ❤️ and follow for more AI agent builds!

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.