Building Visio — A Real-Time AI Accessibility Agent with Gemini and Google ADK

#googlecloud #ai #gemini #a11y

This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem

285 million people worldwide are visually impaired. Navigating everyday environments — stairs, parked vehicles, approaching people, unmarked curbs — requires constant assistance. Existing solutions are either passive (a white cane detects obstacles at arm's length) or delayed (photo-based apps require stopping and waiting for a response).

I wanted to build something that works in real-time — like having a friend walking beside you, continuously watching and speaking.

What I Built

Visio is a real-time AI accessibility agent. Point your phone's rear camera forward while walking, and Visio continuously narrates your surroundings through your headphones:

"Motorcycle ahead on your right, move left to pass"
"You're past it. Pole ahead on your left, step around right"
"Person in a blue jacket walking toward you"
"Two steps down ahead, slow down"

It has three modes — Navigation (hazard detection), Reading (text/signs), and Exploration (scene descriptions) — plus Emergency SOS with GPS, spatial audio, and haptic feedback.

Live demo: visio-agent-kiofaqcoyq-uc.a.run.app

The Tech Stack

Google ADK + Gemini 2.5 Flash

The core of Visio is Google ADK (Agent Development Kit) running Gemini 2.5 Flash with native bidirectional audio streaming. This was the key decision — traditional request/response AI has 2-5 second latency. With BIDI streaming, Visio sees and speaks simultaneously with sub-second response times.

run_config = RunConfig(
    streaming_mode=StreamingMode.BIDI,
    response_modalities=["AUDIO"],
    proactivity=types.ProactivityConfig(proactive_audio=True),
    context_window_compression=types.ContextWindowCompressionConfig(
        trigger_tokens=100000,
        sliding_window=types.SlidingWindow(target_tokens=80000),
    ),
)

The proactive_audio=True flag is what makes Visio speak without being asked — essential for a navigation agent where the user can't see the screen to trigger responses.

Server-Side Intelligence

The model alone can't maintain reliable proactivity. I built several server-side systems to bridge the gaps:

Obstacle Memory — The server tracks what hazards the model has reported. When the model says "clear" too soon after detecting obstacles, the server injects a [SCAN AHEAD] prompt forcing it to check what's next.

Silence Monitor — If the model goes quiet for 7+ seconds while the user is walking, the server nudges it with a [HEARTBEAT] prompt.

Turn Re-scan — Gyroscope data from the phone detects when the user changes direction. The server immediately injects a [DIRECTION CHANGE] prompt so the model re-scans the new field of view.

Walking Updates — Every 5 seconds while the user is moving, the server prompts the model to scan for new obstacles since the last report.

Adaptive Frame Rate

Sending camera frames at a constant rate wastes tokens and money. I used the phone's accelerometer for step detection and speed estimation:

Stationary: 0.5 FPS (save tokens)
Slow walk: 1.3 FPS
Normal walk: 2 FPS
Running: 2.5 FPS

Combined with frame-diff analysis that skips unchanged frames, this cut token usage by roughly 60% with no reduction in safety.

Client-Side Features

The browser client (vanilla JS) does more than just capture and send:

Proximity detection — Edge analysis on the bottom quarter of each frame detects near-ground obstacles
Spatial audio — A StereoPannerNode pans the model's voice based on directional keywords ("on your left" plays from the left headphone)
Haptic feedback — Different vibration patterns for critical, warning, and info alerts
Auto-focus — Switches to near-range focus in reading mode for close-up text

Google Cloud Services

Service	Purpose
Gemini 2.5 Flash	Real-time multimodal AI with native audio
Google ADK	Agent framework with BIDI streaming
Google Search	Grounding for landmarks and brands
Cloud Run	Serverless container hosting
Cloud Build	Automated Docker builds
Cloud Logging	Structured session logs
Firestore	Session analytics

Deployment is a single command: ./deploy.sh PROJECT_ID handles Cloud Build, Container Registry, and Cloud Run deployment automatically.

The Biggest Challenge: Obstacle Chaining

The hardest problem was what I call "obstacle amnesia." The model would warn about a parked bike, the user would pass it, and then... silence. The post 3 meters ahead? Not mentioned.

The solution was two-fold:

System prompt architecture — A dedicated "Obstacle Chaining" section that explicitly instructs: after clearing ANY obstacle, immediately scan for the NEXT one. Never go silent after "you're past it."
Server-side scan-ahead prompts — When the obstacle memory detects the model said "clear" or "you're past it," it injects: [SCAN AHEAD] You just cleared {obstacle}. Scan for the NEXT obstacle. What's ahead NOW?

This simple pattern — prompt engineering + server-side reinforcement — made the difference between a demo that works sometimes and an agent that reliably chains obstacle-to-obstacle without gaps.

What I Learned

Proactive audio needs scaffolding. Gemini's proactive audio is powerful but designed for conversation. For continuous narration, server-side prompt injection is essential.
LLMs don't persist state. Obstacle memory, silence monitoring, turn detection — all of this must live on the server because the model can't remember across turns.
System prompt architecture > model parameters. The 15-module system prompt (obstacle chaining, people awareness, priority tiers, surface hazards) determines quality more than any configuration.
Build for the user who can't see the screen. Every design decision — spatial audio, haptic patterns, voice commands — had to work without visual feedback.