DEV Community

Lalit Mishra
Lalit Mishra

Posted on

Seeing is Believing: Multimodal AI Agents with Google Gemini Live API

Introduction – Why Voice Alone Is Not Enough

For the past year, the engineering community has aggressively optimized voice-only AI agents, driving glass-to-glass latency down to the sub-500 millisecond threshold. However, as we deploy these systems into complex real-world environments, we repeatedly encounter the fundamental bandwidth limitation of human speech. Voice is a remarkably low-bandwidth and highly ambiguous channel for conveying spatial, visual, or structural information. Attempting to explain a flashing diagnostic LED sequence on a server rack, or the exact geometry of a cracked mechanical gasket, is an inherently lossy process.

To bridge this gap, real-time communications architecture is shifting from voice-only interactions to fully multimodal streaming. By feeding an AI agent a synchronized stream of audio and video frames, we unlock "Look and Talk" capabilities. The agent can perceive the user's environment, eliminating the need for the user to meticulously describe their surroundings. Implementing this at scale, however, is a formidable distributed systems challenge. Processing continuous video streams against a Large Language Model introduces massive context windows, severe bandwidth constraints, and the constant threat of bufferbloat. This article details how to architect a production-grade, low-latency multimodal AI agent using the Google Gemini Live API, navigating the complexities of bidirectional WebSockets, Python proxy orchestration, and dynamic backpressure management.


We are ending the series of WebRTC and we want a feedback and request for new topic for the new series.

But beyond the architecture diagrams and latency graphs lies something more meaningful. We are not just optimizing packets—we are reducing friction between human intent and machine understanding. Every dropped frame we prevent, every millisecond we shave off latency, brings us closer to interactions that feel less like operating software and more like collaborating with intelligence. Multimodal systems are not simply a technical upgrade; they are a step toward AI that genuinely sees and listens alongside us.


Gemini Live vs OpenAI Realtime: Architecture and Capability Differences

When designing a streaming AI pipeline, architects are currently presented with two primary paradigms: the WebRTC-centric model utilized by OpenAI’s Realtime API, and the WebSocket-centric bidirectional streaming model employed by the Google Gemini Live API.

OpenAI’s architecture leans heavily on WebRTC, pushing for a direct-to-edge pattern where the client browser establishes a peer-to-peer UDP connection directly with OpenAI’s media servers. This is exceptionally efficient for minimizing audio latency and navigating NAT traversals, but it makes intercepting the media stream for server-side inspection, recording, or moderation exceedingly difficult. WebRTC's complexity also limits the ease with which custom video frame extraction can be injected into the stream, as it relies on standard media track encoding.

Conversely, the Gemini Live API utilizes a persistent, bidirectional WebSocket connection (TCP). While TCP introduces the theoretical risk of head-of-line blocking under heavy packet loss, it provides a significantly more controllable architectural boundary. WebSockets allow backend engineers to easily deploy a Python middlebox (proxy) that terminates the client connection, authenticates the user, and orchestrates the media relay to Google's endpoints. Furthermore, Gemini’s streaming API accepts explicit JSON envelopes containing base64-encoded or raw binary chunks of interleaved audio and image data. This explicit framing gives the frontend developer absolute deterministic control over the frame sampling rate, resolution, and compression of the video feed before it ever hits the network, a crucial capability for managing the massive token consumption of video LLMs.

comparison diagram between the OpenAI Realtime WebRTC flow and the Gemini WebSocket streaming flow


Designing a Multimodal Streaming Pipeline (Audio + Video over WebSockets)

The client-side streaming pipeline must continuously capture microphone audio and camera video, encode them into the highly specific formats expected by the Gemini API, and multiplex them over a single WebSocket connection. Gemini expects audio as raw PCM16 (16-bit, 16kHz or 24kHz, mono) and video as a sequence of independent JPEG or WebP images.

Capturing audio for this pipeline requires bypassing the standard MediaRecorder API, which outputs compressed chunks (like WebM or AAC) that Gemini cannot natively ingest in its real-time socket. Instead, we must utilize the Web Audio API with an AudioWorklet to extract raw PCM audio buffers, downsample them to 16kHz, and convert them to Int16 arrays.

Simultaneously, we must extract frames from the video track. We cannot stream 30 frames per second; doing so would exhaust both the client's upload bandwidth and the LLM's context window within seconds. We implement an adaptive sampling loop—typically capturing 1 frame per second for static scenes, bursting to 3 frames per second when the user moves the camera. We achieve this using the modern MediaStreamTrackProcessor API (part of WebCodecs) or a hidden Canvas element to draw the video frame, downscale it to a 512x512 resolution to preserve tokens, and compress it to a highly optimized JPEG.

// Conceptual Multimodal Capture and WebSocket Transmission
const ws = new WebSocket('wss://your-python-proxy.internal/stream');

// 1. Audio Setup (PCM16 via AudioWorklet)
const audioContext = new AudioContext({ sampleRate: 16000 });
const stream = await navigator.mediaDevices.getUserMedia({ audio: true, video: { facingMode: "environment" } });
const source = audioContext.createMediaStreamSource(stream);
await audioContext.audioWorklet.addModule('pcm-processor.js');
const pcmNode = new AudioWorkletNode(audioContext, 'pcm-processor');
source.connect(pcmNode);

pcmNode.port.onmessage = (event) => {
    // Event contains Int16Array of PCM data
    const base64Audio = arrayBufferToBase64(event.data.buffer);
    ws.send(JSON.stringify({
        realtime_input: {
            media_chunks: [{
                mime_type: "audio/pcm;rate=16000",
                data: base64Audio
            }]
        }
    }));
};

// 2. Video Frame Extraction Loop (Canvas Fallback Method)
const video = document.createElement('video');
video.srcObject = stream;
await video.play();
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
canvas.width = 512; canvas.height = 512;

setInterval(() => {
    ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
    const base64Image = canvas.toDataURL('image/jpeg', 0.6).split(',')[1];
    ws.send(JSON.stringify({
        realtime_input: {
            media_chunks: [{
                mime_type: "image/jpeg",
                data: base64Image
            }]
        }
    }));
}, 1000); // Sample at 1 FPS

Enter fullscreen mode Exit fullscreen mode

 a multimodal streaming pipeline diagram showing: Browser → Audio Capture → Video Frame Extraction → Frame Encoding → WebSocket → Python Proxy → Gemini Live → Streaming Response.


Building the Python Proxy Layer (Authentication, Relay, Security Controls)

A fundamental security constraint of the Gemini Live API is that it requires a Google Cloud IAM Service Account or an API Key for authentication. Embedding this credential in the frontend JavaScript is a catastrophic security vulnerability. Therefore, we must introduce a Python proxy layer. This backend service is responsible for receiving the untrusted client WebSocket connection, validating an ephemeral session token (like a JWT), and then opening a secure, server-to-server WebSocket connection to generativelanguage.googleapis.com.

This proxy must operate as an asynchronous bidirectional relay. We utilize Python frameworks like FastAPI or Quart, leveraging asyncio to simultaneously read from the client and write to Gemini, while concurrently reading responses from Gemini and writing them back to the client. This is also the architectural boundary where we implement security controls: payload size limits, request rate limiting, and session duration enforcement.

# Conceptual FastAPI Proxy for Gemini Live API
import asyncio
import os
import json
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
import websockets

app = FastAPI()
GEMINI_WS_URL = f"wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContent?key={os.getenv('GEMINI_API_KEY')}"

@app.websocket("/stream")
async def gemini_proxy(client_ws: WebSocket, token: str):
    # 1. Validate Ephemeral Auth Token
    if not validate_jwt(token):
        await client_ws.close(code=1008)
        return

    await client_ws.accept()

    # 2. Connect to Gemini Live
    async with websockets.connect(GEMINI_WS_URL) as gemini_ws:

        # Initial Setup Message required by Gemini
        setup_msg = {"setup": {"model": "models/gemini-1.5-pro"}}
        await gemini_ws.send(json.dumps(setup_msg))

        async def client_to_gemini():
            try:
                while True:
                    data = await client_ws.receive_text()
                    # Optional: Inspect or log client payload here
                    await gemini_ws.send(data)
            except WebSocketDisconnect:
                pass

        async def gemini_to_client():
            try:
                while True:
                    response = await gemini_ws.recv()
                    # Optional: Intercept Tool Calls here
                    await client_ws.send_text(response)
            except websockets.exceptions.ConnectionClosed:
                pass

        # 3. Run bidirectional relay concurrently
        await asyncio.gather(
            client_to_gemini(),
            gemini_to_client()
        )

Enter fullscreen mode Exit fullscreen mode

a secure backend proxy architecture diagram showing: Client → Auth Layer → Ephemeral Session Token → Proxy Relay → Gemini API → Tool Execution → Backend Services.


Tool Calling and Long Context Handling in Gemini

The true power of an AI agent emerges when it can take action based on what it sees. In the Gemini streaming protocol, tool orchestration requires precise state machine management within our Python proxy. When the multimodal model identifies a component in the video stream—for instance, recognizing a specific barcode on a shipping label—and decides to query an internal database, it halts its audio output and emits a functionCall payload over the WebSocket.

The Python proxy must intercept this payload rather than blindly forwarding it to the client. The proxy pauses the upstream media relay, executes the requested Python function against your internal backend services, formats the result into a functionResponse envelope, and transmits it back to the Gemini WebSocket. Once Gemini ingests the function response, it resumes streaming synthesized audio to the client, explaining the results of the database query.

Managing the context window is another critical engineering challenge. While Gemini 1.5 Pro boasts a massive context window capable of holding millions of tokens, streaming 1 frame per second alongside continuous audio will inevitably cause token accumulation. As the context window grows, inference latency increases, and API costs scale linearly. Architects must implement a session lifecycle policy. For prolonged sessions, the proxy should periodically invoke a summarization tool to compress the conversation history, close the active WebSocket, and transparently open a new connection seeded with the summarized context and the most recent video frames.


Latency, Bandwidth, and Backpressure Engineering

Streaming continuous video to a cloud LLM is entirely different from streaming to a media server like Janus or Mediasoup. An LLM cannot skip packets or gracefully degrade via simulcast layers; it requires complete, uncorrupted image frames. If you attempt to stream raw high-resolution video, you will immediately encounter the physics of network bottlenecks.

If a client attempts to upload 1080p video at 30 frames per second over a WebSocket, the TCP send buffer will rapidly fill. This phenomenon, known as bufferbloat, means that the critical audio packets containing the user's speech are trapped in a queue behind megabytes of video data. A question spoken by the user might take four seconds to physically leave the device, destroying the conversational illusion.

To solve this, we must engineer strict backpressure mechanisms. In the JavaScript client, before executing ws.send() for a video frame, we must inspect the ws.bufferedAmount property. If the buffered amount exceeds a specific threshold—for example, 64 kilobytes—we explicitly drop the video frame.

// Backpressure Video Drop Logic
const MAX_BUFFER_SIZE = 64 * 1024; // 64KB

function sendVideoFrame(base64Image) {
    if (ws.bufferedAmount > MAX_BUFFER_SIZE) {
        console.warn("Network congested. Dropping video frame to preserve audio latency.");
        return; // Drop frame
    }

    ws.send(JSON.stringify({
        realtime_input: {
            media_chunks: [{
                mime_type: "image/jpeg",
                data: base64Image
            }]
        }
    }));
}

Enter fullscreen mode Exit fullscreen mode

This enforces a strict prioritization: audio (the conversational turn) is sacrosanct and must always traverse the network immediately, while video is treated as an opportunistic enhancement. By keeping frames compressed to roughly 30KB (512x512 JPEG) and sampling at 1 FPS, the total video payload requires less than 300 kbps of sustained upload bandwidth, well within the capacity of modern cellular networks.


Production Use Cases: Visual Technical Support and Inspection Workflows

The architectural complexity of multimodal streaming pays immediate dividends in specialized production environments. Consider a visual technical support workflow for telecommunications field technicians. When a technician opens the application, the Gemini agent begins ingesting the camera feed. The technician points their phone at a dense fiber optic routing chassis and says, "Which port is throwing the optical fault?"

Because the agent has continuous visual context, it processes the image frames, reads the specific serial numbers off the chassis via OCR, cross-references them by invoking an internal inventory database tool via the Python proxy, and responds in real-time audio: "The fault is on the third transceiver from the left, serial ending in 402. I'm pulling up the replacement procedure now." This workflow entirely eliminates the tedious diagnostic Q&A process that plagues traditional voice-only support lines. Similar workflows are actively being deployed for remote medical triaging, real estate property inspections, and complex mechanical QA processes.


Failure Modes and Observability

In a stateful, bidirectional streaming architecture, failure handling must be deterministic. The most common failure mode is a transient network partition dropping the WebSocket connection. The client must implement exponential backoff reconnection logic. However, because the LLM context is tied to the lifecycle of the WebSocket on Google's edge, a dropped connection means lost context. The Python proxy must therefore maintain an asynchronous shadow log of the conversation transcript. Upon reconnection, the proxy rapidly rehydrates the Gemini session by injecting the shadow log before bridging the client media stream, ensuring the user does not have to repeat themselves.

Observability must focus on the "Time to First Audio Byte" (TTFAB) metric. This requires instrumenting the Python proxy to record the timestamp when the user's speech completes (often signaled by a local VAD or an end-of-utterance marker) and measuring the delta until the first byte of synthesized audio is returned from the Gemini API. Tracking the drift between TTFAB and video frame upload rate is the primary indicator of impending network congestion.


Reference Architecture Blueprint

To deploy this system securely and reliably, adhere to the following architectural blueprint:

  1. Browser Capture Layer: Utilizes AudioWorklet for PCM16 extraction and MediaStreamTrackProcessor (or hidden Canvas) for extracting video frames at an adaptive 1 to 3 FPS rate.
  2. Frame Preprocessing: Downscales video frames to 512x512 and applies aggressive JPEG compression to strictly bound the byte size per frame.
  3. WebSocket Transport: Multiplexes JSON envelopes containing base64 audio and video chunks over a single secure WebSocket, applying local backpressure based on bufferedAmount.
  4. Python Proxy Orchestration: A highly concurrent ASGI application (FastAPI/Quart) terminates the client connection, validates identity, and maintains the server-to-server persistent connection to the Gemini API.
  5. Tool Execution Loop: Intercepts functionCall payloads from the LLM, halts upstream media relay, executes internal microservices, and injects the functionResponse back into the stream.
  6. Failure Isolation Boundaries: Decouples the frontend connection lifecycle from the backend-to-LLM lifecycle, allowing the proxy to gracefully handle transient client disconnects without destroying the costly LLM context window.

Conclusion – The Future of “Look and Talk” Systems

The integration of concurrent video and audio streaming into Large Language Models represents a fundamental evolution in human-computer interaction. We are moving past the era of the blind voice assistant. By architecting robust multimodal streaming pipelines using the Gemini Live API, backend engineers can provide AI agents with visual grounding, allowing them to solve complex spatial and environmental problems in real-time. While the challenges of bandwidth optimization, backpressure management, and stateful proxy design are substantial, mastering these patterns is essential for building the next generation of truly context-aware AI infrastructure.

Yet the true significance of “Look and Talk” systems is not merely technical sophistication—it is empathy at scale. When an AI can see what we see and hear what we hear, it reduces the cognitive burden on the user. It removes the need to translate lived reality into imperfect words. In doing so, it shifts AI from being a tool we command to a partner that understands context naturally.

As we close this WebRTC series, we invite you to reflect on where real-time systems are headed next. What challenges are you facing in production? What emerging protocols, architectures, or AI patterns should we explore in the next deep dive? The future of interactive AI is being written in real time—and we would love to build the next chapter together.

Top comments (0)