This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
The Problem
285 million people worldwide are visually impaired. Navigating everyday environments — stairs, parked vehicles, approaching people, unmarked curbs — requires constant assistance. Existing solutions are either passive (a white cane detects obstacles at arm's length) or delayed (photo-based apps require stopping and waiting for a response).
I wanted to build something that works in real-time — like having a friend walking beside you, continuously watching and speaking.
What I Built
Visio is a real-time AI accessibility agent. Point your phone's rear camera forward while walking, and Visio continuously narrates your surroundings through your headphones:
- "Motorcycle ahead on your right, move left to pass"
- "You're past it. Pole ahead on your left, step around right"
- "Person in a blue jacket walking toward you"
- "Two steps down ahead, slow down"
It has three modes — Navigation (hazard detection), Reading (text/signs), and Exploration (scene descriptions) — plus Emergency SOS with GPS, spatial audio, and haptic feedback.
Live demo: visio-agent-kiofaqcoyq-uc.a.run.app
The Tech Stack
Google ADK + Gemini 2.5 Flash
The core of Visio is Google ADK (Agent Development Kit) running Gemini 2.5 Flash with native bidirectional audio streaming. This was the key decision — traditional request/response AI has 2-5 second latency. With BIDI streaming, Visio sees and speaks simultaneously with sub-second response times.
run_config = RunConfig(
streaming_mode=StreamingMode.BIDI,
response_modalities=["AUDIO"],
proactivity=types.ProactivityConfig(proactive_audio=True),
context_window_compression=types.ContextWindowCompressionConfig(
trigger_tokens=100000,
sliding_window=types.SlidingWindow(target_tokens=80000),
),
)
The proactive_audio=True flag is what makes Visio speak without being asked — essential for a navigation agent where the user can't see the screen to trigger responses.
Server-Side Intelligence
The model alone can't maintain reliable proactivity. I built several server-side systems to bridge the gaps:
Obstacle Memory — The server tracks what hazards the model has reported. When the model says "clear" too soon after detecting obstacles, the server injects a [SCAN AHEAD] prompt forcing it to check what's next.
Silence Monitor — If the model goes quiet for 7+ seconds while the user is walking, the server nudges it with a [HEARTBEAT] prompt.
Turn Re-scan — Gyroscope data from the phone detects when the user changes direction. The server immediately injects a [DIRECTION CHANGE] prompt so the model re-scans the new field of view.
Walking Updates — Every 5 seconds while the user is moving, the server prompts the model to scan for new obstacles since the last report.
Adaptive Frame Rate
Sending camera frames at a constant rate wastes tokens and money. I used the phone's accelerometer for step detection and speed estimation:
- Stationary: 0.5 FPS (save tokens)
- Slow walk: 1.3 FPS
- Normal walk: 2 FPS
- Running: 2.5 FPS
Combined with frame-diff analysis that skips unchanged frames, this cut token usage by roughly 60% with no reduction in safety.
Client-Side Features
The browser client (vanilla JS) does more than just capture and send:
- Proximity detection — Edge analysis on the bottom quarter of each frame detects near-ground obstacles
-
Spatial audio — A
StereoPannerNodepans the model's voice based on directional keywords ("on your left" plays from the left headphone) - Haptic feedback — Different vibration patterns for critical, warning, and info alerts
- Auto-focus — Switches to near-range focus in reading mode for close-up text
Google Cloud Services
| Service | Purpose |
|---|---|
| Gemini 2.5 Flash | Real-time multimodal AI with native audio |
| Google ADK | Agent framework with BIDI streaming |
| Google Search | Grounding for landmarks and brands |
| Cloud Run | Serverless container hosting |
| Cloud Build | Automated Docker builds |
| Cloud Logging | Structured session logs |
| Firestore | Session analytics |
Deployment is a single command: ./deploy.sh PROJECT_ID handles Cloud Build, Container Registry, and Cloud Run deployment automatically.
The Biggest Challenge: Obstacle Chaining
The hardest problem was what I call "obstacle amnesia." The model would warn about a parked bike, the user would pass it, and then... silence. The post 3 meters ahead? Not mentioned.
The solution was two-fold:
System prompt architecture — A dedicated "Obstacle Chaining" section that explicitly instructs: after clearing ANY obstacle, immediately scan for the NEXT one. Never go silent after "you're past it."
Server-side scan-ahead prompts — When the obstacle memory detects the model said "clear" or "you're past it," it injects:
[SCAN AHEAD] You just cleared {obstacle}. Scan for the NEXT obstacle. What's ahead NOW?
This simple pattern — prompt engineering + server-side reinforcement — made the difference between a demo that works sometimes and an agent that reliably chains obstacle-to-obstacle without gaps.
What I Learned
Proactive audio needs scaffolding. Gemini's proactive audio is powerful but designed for conversation. For continuous narration, server-side prompt injection is essential.
LLMs don't persist state. Obstacle memory, silence monitoring, turn detection — all of this must live on the server because the model can't remember across turns.
System prompt architecture > model parameters. The 15-module system prompt (obstacle chaining, people awareness, priority tiers, surface hazards) determines quality more than any configuration.
Build for the user who can't see the screen. Every design decision — spatial audio, haptic patterns, voice commands — had to work without visual feedback.
Try It
- Live: visio-agent-kiofaqcoyq-uc.a.run.app
- Demo: YouTube
Open the live URL on your phone, grant camera and microphone access, put on headphones, and walk around. Visio will start talking.
Built for the Gemini Live Agent Challenge — Live Agents category. #GeminiLiveAgentChallenge
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.