Build real-time conversational agents with Gemini 3.1 Flash Live

#ai #gemini #voice #multimodal

Today, we’re launching Gemini 3.1 Flash Live via the Gemini Live API in Google AI Studio. Gemini 3.1 Flash Live helps enable developers to build real-time voice and vision agents that can not only process the world around them, but also respond at the speed of conversation.

This is a step change in latency, reliability and more natural-sounding dialogue, delivering the quality needed for the next generation of voice-first AI.

Experience enhanced latency, reliability and quality

For real-time interactions, every millisecond of latency strips away the natural flow of the conversation that users expect. The new model better understands tone, emphasis and intent, enabling agents with key improvements:

Higher task completion rates in noisy, real-world environments: We’ve significantly improved the model’s ability to trigger external tools and deliver information during live conversations. By better discerning relevant speech from environmental sounds like traffic or television, the model more effectively filters out background noise to remain reliable and responsive to instructions.
Better instruction-following: Adherence to complex system instructions has been boosted significantly. Your agent will stay within its operational guardrails, even when conversations take unexpected turns.
More natural and low-latency dialogue: The latest model improves on latency and is even more effective at recognizing acoustic nuances like pitch and pace compared to 2.5 Flash Native Audio, making real-time conversations feel a lot more fluid and natural.
Multi-lingual capabilities: The model supports more than 90 languages for real-time multi-modal conversations.

Build with an expanding ecosystem of integrations

The Live API is built for production environments, but real-world systems require handling of diverse inputs, from live video streams to on-demand phone calls.

For systems that require WebRTC scaling or global edge routing, we recommend exploring our partner integrations to streamline the development of real-time voice and video agents.

LiveKit — Use the Gemini Live API with LiveKit Agents.
Pipecat by Daily — Create a real-time AI chatbot using Gemini Live and Pipecat.
Fishjam by Software Mansion — Create live video and audio streaming applications with Fishjam.
Vision Agents by Stream — Build real-time voice and video AI applications with Vision Agents.
Voximplant — Connect inbound and outbound calls to Live API with Voximplant.
Firebase AI SDK — Get started with the Gemini Live API using Firebase AI Logic.

Get started with the Live API

Gemini 3.1 Flash Live is available starting today via the Gemini API and in Google AI Studio. Developers can use the Gemini Live API to integrate the model into their application.

Explore our developer documentation to learn how you can build real-time agents:

Gemini Live API documentation: Explore features like multilingual support, tool use and function calling, session management (for managing long running conversations) and ephemeral tokens.
Gemini Live API examples: Get inspiration for the kind of voice experiences you can build today with the model.
Gemini Live API Skill: For coding agents to learn and build with the Live API.

Get started with the Google GenAI SDK:

import asyncio
from google import genai

client = genai.Client(api_key="YOUR_API_KEY")

model = "gemini-3.1-flash-live-preview"
config = {"response_modalities": ["AUDIO"]}

async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        print("Session started")
        # Send content...

if __name__ == "__main__":
    asyncio.run(main())

Top comments (3)

Jess Lee Google AI • Mar 26

I'm excited to play with this.

Thor 雷神 Schaeff Google AI • Mar 26

Please let us know how it goes and what we can improve! 🫶

klement Gunndu • Mar 27

The ephemeral token approach for session security is a smart call. Curious how the noise filtering holds up with multiple overlapping speakers — that's usually where real-time voice agents fall apart.