Vikas Goel

Posted on Mar 16

How I Built a Spatial Intelligence Agent That Sees, Thinks, and Speaks — Using Gemini Live API

#googlecloud #ai #gemini #a11y

Created for the Gemini Live Agent Challenge #GeminiLiveAgentChallenge

What if your phone could be a skilled human guide — one that sees the world through your camera, understands what matters, and tells you only what you need to hear?

That's Drishti (दृष्टि — Sanskrit for "Vision"). It's a spatial intelligence agent built on Google's Gemini Live API that transforms any smartphone into a real-time navigation companion for visually impaired users. No special hardware. No wearable devices. Just a phone on a chest lanyard and a voice that understands your world.

In this post, I'll share how I built it across 17 production revisions, the architectural decisions that made it work, and the hardest problem I solved — one that has nothing to do with AI models and everything to do with time.

The Problem: Chatbots Can't Navigate the Physical World

1.3 billion people worldwide live with visual impairment. Current assistive technology falls into two categories: expensive specialized hardware (thousands of dollars) or AI chatbots that describe what they see frame-by-frame.

Neither works for real-time navigation. Here's why:

A blind person doesn't need to hear "I see a door, a wall, two chairs, a houseplant, and a rug." They need to hear "Door ahead, 2 steps" and then silence until the next thing that matters.

The difference between a chatbot and a guide is editorial judgment: knowing what to say, when to say it, and — most importantly — when to say nothing. A skilled human guide speaks about 8 times in a 5-minute walk. My first prototype spoke 30 times. Getting from 30 to 8 required more engineering than getting from 0 to 30.

The Architecture: Three Geminis, One Conductor

I call it the Conductor Model because the Python backend doesn't see or speak — it conducts three Gemini instances, each doing what it's best at:

Gemini Live API — The Voice

Bidirectional audio streaming with interruption handling. The user can interrupt mid-sentence, switch to Hindi, ask questions. Gemini handles all of this natively with proactive_audio and affective_dialog enabled. The voice feels natural, not robotic.

# Gemini Live session config
config = LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=SpeechConfig(
        voice_config=VoiceConfig(prebuilt_voice_config=PrebuiltVoiceConfig(voice_name="Kore"))
    ),
    system_instruction=Content(parts=[Part(text=system_prompt)]),
    realtime_input_config=RealtimeInputConfig(
        automatic_activity_detection=AutomaticActivityDetection(disabled=False)
    ),
)

Gemini 2.5 Flash — The Eyes

Event-driven scene analysis via generateContent. Not called every frame — called only when something changes (body turn, stair entry, significant visual change). Returns structured JSON:

{
  "environment": {"type": "indoor_stairs", "description": "concrete staircase going down"},
  "path_ahead": {"clear": false, "blocked_by": "coffee table"},
  "objects": [
    {"what": "dog", "where": "ahead, 10 o'clock", "distance": "5 steps",
     "navigation_relevant": true}
  ],
  "suggested_speech": "Dog 5 steps ahead at 10 o'clock."
}

The navigation_relevant field is the key innovation here. In a crowded market, Gemini aggregates 20 people as "not relevant" but flags one person pushing a cart. This single boolean solved the attention budgeting problem that our entire Python tracker couldn't handle.

Critical config: thinking_budget=0. Gemini 2.5 Flash is a thinking model — by default it spends up to 8,192 tokens reasoning before answering. For scene perception, this thinking adds latency without improving quality. Setting thinking_budget=0 cut response time from 7 seconds to 2.4 seconds — a 66% improvement.

Cloud Vision API — The Safety Tripwire

At 200ms latency, Cloud Vision detects vehicles in the walking path before the 2.4-second cognitive analysis can. It doesn't understand scenes — it just says "bicycle, 15% of frame, center." But for a cyclist approaching at 15 km/h, those 2 seconds matter.

The World Model: A Brain That Knows When to Shut Up

The World Model is the conductor. It receives perception results from all three Gemini services and makes one decision: speak or stay silent.

It maintains four behavioral dimensions inspired by cognitive science models of human spatial reasoning:

Alertness — spikes near stairs and crossings, decays in stable environments
Urgency — responds to obstacle proximity, drives cooldown bypass
Spatial Confidence — tracks how fresh our perception is. THIS is the self-correcting mechanism.
Verbosity — responds to user commands ("be quiet" / "describe everything") and environment complexity

These dimensions drive a 9-priority editorial decision engine:

P1: Vehicle emergency (CV, 200ms)
P2: Fast obstacle (CV + temporal validation)
P3: Safety alert (cognitive)
P4: Environment transition (cognitive)
P5: Goal match (speech + cognitive)
P6: Path blocked (cognitive)
P7: New navigation object (cognitive)
P8: Proactive info (cognitive suggested_speech)
P9: Memory augmentation (stored landmarks)
Default: SILENCE

Every priority level is gated by temporal validation (more on that below). The default is silence. This is what makes Drishti a guide, not a narrator.

The Hardest Problem: Time

Here's what nobody tells you about real-time spatial AI: by the time you process a frame, the user has moved.

Gemini 2.5 Flash takes 2.4 seconds to analyze a frame. At walking speed (1.2 m/s), the user moves 2.9 meters in that time. A "door 1 meter ahead" warning arrives when the user already walked through the door.

My first attempt at fixing this was to use Cloud Vision (200ms) as a fast obstacle detector. It created worse problems:

Rev 15 test results:
  7 "Stop!" alerts fired
  Only 2 were correct
  5 were false positives — doors, houseplants, clothing

  User feedback (Hindi, translated):
  "Why report the box so late?"
  "The houseplant info is old"

CV has no spatial understanding. "Door, 18% of frame, center" — is the user walking through it or into it? CV doesn't know. Adding more filters created a cascade of new problems.

The Real Fix: Temporal Validation

Every camera frame gets stamped with the phone's sensor state at capture time: speed, compass heading, step count. When a perception result arrives seconds later, the system computes:

distance_moved = (current_step_count - snapshot_step_count) * 0.7  # meters per step
heading_change = abs(current_heading - snapshot_heading)
remaining = obstacle_distance - distance_moved

if remaining < -0.5:    # User passed it
    return "STALE"       # → Drop silently. User already walked through.

if remaining < 1.5:     # About to hit it
    return "IMMINENT"    # → Bypass cooldown. Warn NOW.

return "VALID"           # → Normal alert with corrected distance

This doesn't need GPS (too imprecise indoors — ±10-20m). Step count from the accelerometer and compass heading work everywhere, including inside buildings.

Results from Rev 16 session:

Status	Count	Example
Stale (dropped)	6	"user turned 105°", "user moved 4.5m past obstacle"
Imminent (urgent)	1	couch at 0.8m → "Stop! Couch 1 step ahead!"
Valid (normal)	1	houseplant at 1.6m → "houseplant about 2 steps ahead"

Six false positives silently killed. One genuine obstacle correctly urgent. Zero false alerts reached the user.

The Brain's Self-Correcting Loop

The spatial confidence dimension creates an automatic feedback loop:

Cognitive runs → confidence HIGH (we know what's here)
User walks 3 meters → confidence DECAYS (we're in unknown territory)
Confidence drops below 0.2 → vigilance SPIKES → triggers new cognitive call
New cognitive runs on fresh frame → confidence RECOVERS

The system literally knows when it doesn't know, and actively seeks to fix that. When I showed the brain panel to testers, the consistent reaction was "it's thinking" — exactly the response I wanted.

What I Discovered About Gemini

Function calling from Live API has ~16% success rate

I ran 6 controlled experiments sending identical tool declarations through the native audio model. Function calls were mechanically dropped — not a prompting issue, but a platform limitation. This forced the complete architectural separation: Live for voice, generateContent for perception.

In hindsight, this made the system better. Each Gemini instance does what it's best at.

Gemini's visual understanding beats custom CV pipelines

My v3.1 Python pipeline used SORT tracking (Kalman Filter + Hungarian Algorithm) to maintain object identity across frames. It produced 12 false "houseplant approaching" alerts in a 5-minute test. Gemini said "Potted plants line both sides of the path" — accurate, contextual, mentioned once. I deleted 400 lines of tracker code.

Silence requires more engineering than speech

Getting the system to speak was trivial — inject text, Gemini talks. Getting it to stay silent for 2 full minutes during stable walking while remaining ready to warn about obstacles required the entire editorial decision engine, behavioral dimensions, cooldown management, temporal validation, and the self-correcting confidence loop.

Phone Sensors: The Underrated Superpower

The accelerometer, compass, and step counter in your phone are incredibly reliable and available in every web browser via the DeviceMotion and DeviceOrientation APIs:

// Step detection from accelerometer
window.addEventListener('devicemotion', (e) => {
    if (Math.abs(e.acceleration.y) > 3.5) {
        const now = Date.now();
        if (now - lastStepTime > 250) {
            stepCount++;
            lastStepTime = now;
        }
    }
});

// Compass heading
window.addEventListener('deviceorientation', (e) => {
    heading = e.alpha; // 0-360 degrees from North
});

iOS caveat: Since iOS 13, DeviceMotionEvent.requestPermission() must be called inside a user gesture handler (click/tap), before any other async calls. Our sensors silently failed in 3 production revisions before we figured this out.

With step count and compass heading, we can compute:

Distance moved: steps × 0.7m stride length
Direction changed: heading delta since frame capture
Movement state: walking / stationary / stairs (from gravity-axis deviation)

This is all the temporal validator needs. No GPS required.

The Goal System: Implicit + Explicit, One Priority List

Goals emerge from two sources and coexist seamlessly:

Implicit goals emerge from the environment. When the user enters a staircase, stair_navigation activates at priority 0.9 — automatically, from sensors detecting vertical acceleration + cognitive confirming stairs. When the user reaches level ground, it expires automatically. No user input needed.

Explicit goals come from user speech. When the user says "Machli kahan milegi?" (Hindi: "Where can I find fish?"), the ConversationInterpreter extracts a goal from Gemini's response (not from the user's garbled speech — Gemini's response is cleaner and more reliable). The goal persists until achieved or expired.

Both goal types sit in ONE priority-sorted list. The World Model's decide() method doesn't know or care where a goal came from. When cognitive detects a "FRESH FISH" sign 2 minutes later, the goal matches and fires. The user hears: "Fresh Fish sign on your right!"

The Evolution: 17 Revisions

Rev	What happened	What I learned
6	v3.1 deployed, 30 alerts in 5 min	Python perception pipeline produces too many false positives
7	v4 first deploy, 89/91 cognitive calls failed (429)	Free tier has 20 RPD limit
12	"Coffee table 2 steps ahead. Clear to your left."	The conductor model WORKS
13	Too silent — 4 utterances in 209 seconds	Verbosity gate was blocking suggested_speech
14	Cognitive 7s latency, warnings too late	thinking_budget=0 cuts to 2.4s
15	5/7 "Stop!" were false positives	CV can't make navigation judgments
16	6 stale dropped, 1 imminent correct, 0 false	Temporal validation solves everything

Each revision was tested in production — real walking, real obstacles, real dog. Structured logs from every session drove the next architectural decision.

Try It Yourself

Drishti is live at drishti-whn43ovjpq-uc.a.run.app

Open on a phone (iOS or Android), grant camera + mic + sensor permissions, and walk around your house. You'll hear Drishti describe obstacles — and more importantly, you'll hear it stay quiet when there's nothing to say.

Source code: github.com/vikasjoel/dristi

Tech Stack

Backend: Python 3.11, FastAPI, WebSockets on Google Cloud Run
AI: Gemini Live API (voice), Gemini 2.5 Flash (perception), Cloud Vision API (safety)
Frontend: HTML5 PWA with Web Sensors API, Geolocation API
Key configs: thinking_budget=0, proactive_audio, affective_dialog

What's Next

Stationary camera modes — baby monitoring, elderly care, security (plugin architecture already supports them)
Topological mapping from scene transitions — building spatial maps without SLAM
Maps API integration for navigation beyond camera view
User calibration — learning individual stride length and verbosity preferences

Built with genuine passion for accessibility. If you're building with the Gemini Live API, I'd love to hear what you're creating. The future of AI isn't text boxes — it's spatial intelligence.

#GeminiLiveAgentChallenge

Created for the purposes of entering the Gemini Live Agent Challenge hackathon.

DEV Community