Building Iris: A Real-Time Spatial Awareness Agent with the Gemini Live API

#google #gemini #ai #hackathon

Created for the Gemini Live Agent Challenge #GeminiLiveAgentChallenge

What is Iris?

Iris is a real-time spatial awareness agent that sees through your camera and talks to you. Point your device at anything — a room, a street, a workspace — and Iris describes what it sees, warns you about obstacles, reads signs, and identifies people and their gestures. All through voice, hands-free.

It's not just an accessibility tool. Iris is built for anyone who needs an extra pair of eyes — a warehouse worker navigating a crowded floor, a cyclist wanting awareness of their blind spot, a remote worker showing their setup to a colleague, or a visually impaired person walking through an unfamiliar building. The camera becomes a conversation partner.

Why This Matters

We interact with AI through text boxes. We type, we wait, we read. But spatial awareness doesn't work in turns — the real world moves continuously, and the information you need is often urgent. "There's a step ahead of you" is useless five seconds late.

Iris breaks the text-box paradigm. It watches continuously and speaks up when something changes. No wake words, no buttons, no screens to look at. The interaction model is closer to having a co-pilot than using a chatbot.

For visually impaired users, this could mean navigating a grocery store independently. For a delivery driver, it could mean hands-free package verification. For a security professional, it could mean real-time monitoring narration. The same core capability serves fundamentally different use cases.

How We Built It

Iris runs on the Gemini Live API using the @google/genai SDK over WebSocket. The frontend is React with TypeScript. Here's the pipeline:

Camera → Canvas → Base64 JPEG → Gemini Live API → Audio Response → Speaker

We capture webcam frames at 3fps, downscale them to 50% resolution on a hidden canvas, encode as JPEG, and stream them via sendRealtimeInput(). Audio comes back as PCM and plays through the Web Audio API.

The interesting part is scene change detection. The Gemini Live API is reactive — the model only speaks when you speak to it. Video frames are passive context. So we built a client-side pixel-diff algorithm: a second hidden canvas at 80x45 pixels compares grayscale values between consecutive frames. When the mean absolute difference crosses a threshold, we send a text nudge to the model: "The scene just changed. Describe what you see."

This was harder than it sounds.

The Challenges

Hallucination was our biggest enemy. Early versions would confidently describe objects that weren't there. We discovered that how you send data matters enormously — text messages sent via sendClientContent() go through a different channel than video frames sent via sendRealtimeInput(). The model would receive a text question without visual context and just guess. Our fix: capture the current frame and attach it inline with every text message.

The scene detection threshold is a balancing act. Too sensitive (threshold 12) and the model gets triggered by camera noise, leading to constant hallucinated descriptions. Too conservative (threshold 30) and it misses real changes. We settled on 20 with an 8-second cooldown — not perfect, but stable.

"Say nothing if nothing changed" doesn't work with audio output. We tried periodic polling where the model should stay silent when the scene was static. But with responseModalities: [AUDIO], the model must produce audio for every turn. It would fabricate descriptions just to have something to say. The pixel-diff approach solved this by only asking when something genuinely changed.

What's Next

Iris is functional but far from finished. The vision model still makes mistakes — it might call a bookshelf a window or miss a person standing still. These are limitations of the underlying model, not the architecture. As Gemini's vision improves, Iris improves automatically.

We're also exploring multilingual support — Iris already supports 10 languages through a simple dropdown that adjusts the system prompt. No translation API needed; Gemini handles it natively.

The code is open source: GitHub

Built with the Gemini Live API, Google GenAI SDK, and React. #GeminiLiveAgentChallenge