Sarthak Rawat

Posted on Mar 16

Building OmniSight: A Real-Time AI Visual Companion Powered by Gemini Live and Google Cloud

#geminiliveagentchallenge #productivity #python #opensource

I created this blog for detailing about my project in Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem: The World Doesn't Fit in a Text Box

Most AI assistants are built around a simple loop: you type something, the AI responds. That works great for writing emails or answering trivia. But the real world doesn't fit in a text box.

What if you're holding up a medical bill and want to know if you're being overcharged? What if you're in a foreign country and can't read the menu? What if you're visually impaired and need someone to describe what's in front of you? What if you're signing a lease and want to know if clause 14 is a red flag?

These are real problems. And they all share one thing: they require an AI that can see.

Google's Project Astra gave us a glimpse of what this could look like, a real-time visual agent that sees, understands, and remembers. But Astra was a prototype. We wanted to build the practical version: an agent equipped with specialized tools for real-world, high-stakes use cases. That's OmniSight.

OmniSight is a real-time AI visual companion. Point your camera at anything :- a product, a document, a medication bottle, a street sign in Japanese, and have a natural voice conversation about what you see. It remembers what it's seen across sessions, watches for things you care about, and grounds every answer in specialized APIs rather than guessing.

The Architecture: How It All Fits Together

At a high level, OmniSight has three layers:

React Client :- The user's viewport. Publishes camera and microphone tracks to a LiveKit room. Built with React 19, Vite, TypeScript, and @livekit/components-react.
FastAPI Server :- Handles auth (JWT + bcrypt), session management, and LiveKit room creation. Persists everything to MongoDB via Motor (async driver).
LiveKit Agent Worker :- The brain. A Python agent that joins the LiveKit room, receives the real-time audio/video stream, and runs Gemini Live API for natural conversation with 13 specialized tools.

The data flow looks like this:

User logs in → selects a mode (Healthcare, Shopping, Travel, etc.) → clicks "Connect"
Client requests a token from FastAPI (POST /api/token)
FastAPI creates a LiveKit room, persists a session placeholder to MongoDB, returns a participant token
Client connects to the LiveKit room, publishing camera + mic tracks
The LiveKit agent worker joins the room, loads the user's memory and watchlist from MongoDB
Audio streams to Gemini Live API in real-time, natural, interruptible conversation
When the agent needs to "see," it captures a frame from the WebRTC video stream and routes it to the appropriate GCP service
On session end, all findings, transcript, and tool logs are persisted to MongoDB. A signed GCS URL is generated for the session recording.

We chose LiveKit because it's the most production-ready WebRTC infrastructure for AI agents. Its Python SDK integrates cleanly with the agent worker pattern, and it handles the complexity of real-time audio/video routing so we can focus on the agent logic.

We chose Gemini Live API (gemini-2.5-flash-native-audio-preview) because it's the only production-ready real-time multimodal API that handles natural interruptions. Users can interrupt mid-sentence and the agent recovers seamlessly, this was non-negotiable for a voice-first experience. Turn-based systems feel robotic; Gemini Live feels like a real conversation.

The Grounding Strategy: Why We Don't Just Use Gemini Vision Alone

This is the most important architectural decision we made, and it's what separates OmniSight from a simple "camera + chat" demo.

Early in development, we routed every image directly to Gemini vision. It was fast. But hallucinations were a real problem, the agent would confidently misidentify products, miss critical contract clauses, or give imprecise OCR results. For a healthcare or legal use case, that's not acceptable.

The solution: ground Gemini's reasoning in specialized APIs.

Here's how the grounding pipeline works:

For visual identification (capture_and_analyze):

gemini_result, vision_result = await asyncio.gather(
    gemini_analyze(image_bytes, query_hint),
    cloud_vision.full_analysis(image_bytes),
    return_exceptions=True,
)

Gemini vision and Cloud Vision run in parallel. Gemini gives us rich natural language understanding. Cloud Vision gives us structured, high-confidence detections (landmarks with GPS coordinates, logos with confidence scores, labels with probabilities). The agent combines both for a more reliable answer.

For document extraction (scan_document):

doc_result, gemini_result = await asyncio.gather(
    document_ai.process_document(image_bytes, "image/jpeg"),
    _gemini_document_analyze(image_bytes, doc_type),
    return_exceptions=True,
)

Google Cloud Document AI handles structured extraction, tables, form fields, key-value pairs, with accuracy that generic vision models can't match. Gemini then provides the plain-language summary and risk assessment on top of that structured data.

For entity extraction (extract_entities):
After Document AI extracts the text, we run it through Cloud Natural Language API to pull out people, organizations, dates, and monetary amounts with entity-level sentiment. This is critical for contracts and medical bills.

The confidence module:
We built a grounding module (tools/grounding.py) that checks confidence scores from every analysis and flags low-confidence results:

def check_visual_confidence(result: dict, thresholds: dict | None = None) -> dict:
    # Checks landmark, logo, and label confidence scores
    # Returns is_confident: bool + specific warnings
    ...

When confidence is low, the agent proactively asks the user for more context rather than guessing. This builds trust, users know OmniSight won't make things up.

The 13 Tools: A Specialized Toolkit for the Real World

The agent is equipped with 13 tools across 7 categories:

Vision & Reality

capture_and_analyze :- Gemini + Cloud Vision in parallel. The go-to for any identification task.
multi_frame_capture :- Captures 2-5 frames over a set interval. Used for multi-page documents, different angles, or when one frame isn't enough. Includes a consistency check across frames.
detect_text :- Precise OCR via Cloud Vision. 200+ languages.
detect_landmark :- Identifies monuments and notable buildings with GPS coordinates.
detect_logo :- Brand identification from logos.
visual_search :- Reverse image search via Cloud Vision web detection.

Document Intelligence

scan_document :- Document AI + Gemini in parallel. Handles contracts, receipts, bills, forms.
extract_entities :- Natural Language API entity extraction on scanned text.
analyze_clauses :- Gemini-powered clause analysis for contracts and legal documents.

Search & Pricing

web_search :- Real-time web search via Tavily.
compare_prices :- Product price comparison across retailers.
check_pricing :- Fair pricing check for medical bills, invoices, repair quotes.

Health & Safety

read_medication :- Scans prescription bottles. Returns name, dosage, instructions, warnings, expiry.
scan_food :- Food label analysis with allergen detection against the user's profile.
analyze_ingredients :- Deep health assessment of food ingredients.

Location (Google Maps)

find_nearby :- Google Places API. Find restaurants, pharmacies, ATMs within a radius.
get_directions :- Google Routes API. Walking, driving, transit, cycling.
geocode :- Address to coordinates conversion.

Memory (Cross-Session)

remember / recall :- Persistent MongoDB memory across sessions. The agent remembers products, prices, documents, and preferences from previous sessions.

Smart Alerts / Watchlist

add_alert / remove_alert / list_alerts / check_alerts :- Users set conditions ("alert me if you see Nike shoes under $50", "watch for peanut ingredients"). After every analysis, the agent checks observations against the watchlist and proactively notifies the user on a match.

Export

generate_session_report :- Aggregates all session findings into a structured HTML report, saved to MongoDB and accessible from the History page.

Key Technical Decisions

Singleton Clients

Every GCP client and Gemini client is initialized once as a module-level singleton:

_genai_client: genai.Client | None = None

def _get_genai_client() -> genai.Client:
    global _genai_client
    if _genai_client is None:
        _genai_client = genai.Client(api_key=GOOGLE_API_KEY)
    return _genai_client

This avoids recreating clients on every tool call, reducing connection overhead and latency significantly. The same pattern is applied to Cloud Vision, Document AI, Natural Language, and Google Maps clients.

Async-First with `run_in_executor`

All GCP client libraries are synchronous. We wrap every blocking call in asyncio.get_running_loop().run_in_executor() to avoid blocking the event loop:

async def full_analysis(image_bytes: bytes) -> dict:
    return await asyncio.get_running_loop().run_in_executor(
        None, partial(_sync_full_analysis, image_bytes)
    )

This keeps the agent responsive during long-running API calls.

Parallel API Calls

Wherever possible, we run multiple API calls in parallel using asyncio.gather:

gemini_result, vision_result = await asyncio.gather(
    gemini_analyze(image_bytes, query_hint),
    cloud_vision.full_analysis(image_bytes),
    return_exceptions=True,
)

This cuts latency roughly in half for the most common tool calls.

Centralized Configuration

All environment variables are managed in a single config.py module. No scattered os.environ calls across the codebase. This makes deployment and environment management clean and predictable.

Graceful Error Handling

Every tool returns a structured error dict rather than raising exceptions. The agent handles low-confidence results, API failures, and invalid inputs gracefully, always giving the user a useful response rather than crashing.

Google Cloud Services: The Full Stack

OmniSight is deeply integrated with Google Cloud. Here's every service we use and why:

Service	Role	Why Google
Gemini Live API (`gemini-2.5-flash-native-audio-preview`)	Real-time voice + vision	Only production-ready API with natural interruption handling
Gemini Vision (`gemini-2.5-flash`)	Image analysis, document understanding, clause analysis	Rich natural language output, multimodal reasoning
Cloud Vision API	OCR, landmark detection, logo detection, web detection	Structured, high-confidence detections with confidence scores
Document AI	Structured document extraction (forms, tables, key-value pairs)	Industry-leading accuracy for contracts, receipts, bills
Natural Language API	Entity extraction, sentiment analysis	Reliable entity detection with salience scores
Google Cloud Storage	Session recording archives	Seamless LiveKit egress integration, signed URL generation
Google Maps Places API	Nearby place search	Comprehensive POI database, real-time open/closed status
Google Maps Routes API	Turn-by-turn directions	Multi-modal routing (walking, driving, transit, cycling)
Google Maps Geocoding API	Address to coordinates	Accurate global geocoding

The integration between these services is what makes OmniSight reliable. We're not asking Gemini to guess what a document says, we're using Document AI for structured extraction and Gemini for reasoning on top of that structure. We're not asking Gemini to identify a landmark, we're using Cloud Vision for detection and Gemini for context. Every tool uses the right service for the right job.

The 7 Modes: Context-Aware Behavior

One of OmniSight's most practical features is its mode system. Before starting a session, users select a context:

General :- All-purpose visual companion
Accessibility :- Spatial descriptions, safety warnings, proactive hazard detection for visually impaired users
Education :- Patient tutor mode, step-by-step reasoning, checking questions
Healthcare :- Medication safety, allergen detection, medical document analysis
Shopping :- Price comparison, deal spotting, product identification
Travel :- Translation, landmark context, navigation, cultural tips
Professional :- Contract analysis, invoice extraction, compliance flags

Each mode injects a specialized behavioral prompt into the agent's system instructions. The same underlying tools are available in every mode, but the agent's priorities, communication style, and proactive behavior adapt to the context.

For example, in Healthcare mode, the agent always includes a "consult a doctor" disclaimer and proactively flags expired medications. In Accessibility mode, it uses spatial language ("3 feet ahead, on your left") and prioritizes safety hazards. In Professional mode, it's concise and flags legal/financial concerns immediately.

Challenges We Faced

Real-time latency: Capturing a frame, running it through Cloud Vision and Gemini in parallel, and returning a response, all while maintaining a live voice conversation, requires careful async management. The run_in_executor pattern and parallel asyncio.gather calls were essential here.

Hallucination prevention: Early versions relied too heavily on Gemini vision alone. Adding Document AI and Cloud Vision as structured validators dramatically improved reliability for high-stakes use cases. The confidence module was a late addition that made a big difference, the agent now knows when it doesn't know.

Multi-frame consistency: When capturing multiple frames for document scanning or multi-angle analysis, frames can be inconsistent (user moves the camera, lighting changes). We built a consistency checker that compares entity sets across frames using Jaccard similarity and warns the agent when frames diverge.

Cross-session memory: Designing a memory system that's useful without being noisy was tricky. We settled on a category-based approach (product, document, price, landmark, translation, preference) with keyword search, loading the 20 most recent memories at session start as context.

Results and Learnings

Building OmniSight taught us a few things about real-time multimodal agents:

Grounding matters more than model quality. A well-grounded Gemini Flash response is more reliable than an ungrounded Gemini Pro response. Specialized APIs exist for a reason, use them.

Proactive behavior is hard to tune. The agent needs to speak up when it notices something relevant, but not be annoying. The system prompt does a lot of work here, and it took many iterations to get the balance right.

The voice experience changes everything. When users can interrupt naturally and the agent recovers seamlessly, the interaction feels fundamentally different from a chat interface. Gemini Live API makes this possible in a way that turn-based systems simply can't replicate.

Cross-session memory is underrated. Users love when the agent says "Last time we looked at this TV, the best price was $749 at Amazon." It transforms a tool into a companion.

What's Next

OmniSight is one of three projects our team built for the Gemini Live Agent Challenge. We also built:

A computer control agent (UI Navigator category), an agent that observes your screen and executes actions based on natural language instructions
A 3D interactive history explorer (Creative Storyteller category), a globe you can explore, clicking locations to get rich, interleaved historical narratives with generated imagery and narration

Each project targets a different category, but they all share the same philosophy: use the right tool for the right job, ground AI reasoning in structured data, and build experiences that feel genuinely useful rather than impressive demos.

Try It Yourself

GitHub: https://github.com/SarthakRawat-1/omni-sight
Demo Video: https://vimeo.com/1174104635?share=copy&fl=sv&fe=ci

Built with Gemini Live API, Gemini Vision, Cloud Vision, Document AI, Natural Language API, Google Cloud Storage, Google Maps API, LiveKit, FastAPI, React 19, and MongoDB.

Created for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

DEV Community

Building OmniSight: A Real-Time AI Visual Companion Powered by Gemini Live and Google Cloud

The Problem: The World Doesn't Fit in a Text Box

The Architecture: How It All Fits Together

The Grounding Strategy: Why We Don't Just Use Gemini Vision Alone

The 13 Tools: A Specialized Toolkit for the Real World

Vision & Reality

Document Intelligence

Search & Pricing

Health & Safety

Location (Google Maps)

Memory (Cross-Session)

Smart Alerts / Watchlist

Export

Key Technical Decisions

Singleton Clients

Async-First with `run_in_executor`

Parallel API Calls

Centralized Configuration

Graceful Error Handling

Google Cloud Services: The Full Stack

The 7 Modes: Context-Aware Behavior

Challenges We Faced

Results and Learnings

What's Next

Try It Yourself

Top comments (0)

The Problem: The World Doesn't Fit in a Text Box

The Architecture: How It All Fits Together

The Grounding Strategy: Why We Don't Just Use Gemini Vision Alone

The 13 Tools: A Specialized Toolkit for the Real World

Vision & Reality

Document Intelligence

Search & Pricing

Health & Safety

Location (Google Maps)

Memory (Cross-Session)

Smart Alerts / Watchlist

Export

Key Technical Decisions

Singleton Clients

Async-First with run_in_executor

Parallel API Calls

Centralized Configuration

Graceful Error Handling

Google Cloud Services: The Full Stack

The 7 Modes: Context-Aware Behavior

Challenges We Faced

Results and Learnings

What's Next

Try It Yourself

Async-First with `run_in_executor`