Real-time communication with AI models opens up exciting possibilities for creating interactive applications. Google's Gemini models now support live, multimodal interactions. In this article, I'll walk through how to use this powerful capability to build responsive, voice-enabled applications using Google_GenerativeAI SDK for C# .NET.
Introducing the Multimodal Live API
The Google_GenerativeAI SDK has recently expanded with the Google_GenerativeAI.Live
package, which provides a seamless interface to the Google Multimodal Live API. This implementation leverages WebSockets to enable real-time, bidirectional communication with Gemini models, allowing for dynamic exchange of both text and audio data.
What makes this API particularly valuable is its ability to handle multiple modalities simultaneously, making it perfect for applications that require natural conversations through text, voice, or a combination of both.
Key Features at a Glance
The Google_GenerativeAI.Live
package comes with a robust set of features designed to support sophisticated application development:
- Real-time, bidirectional communication for truly interactive experiences
- Support for multiple response modalities (text and audio)
- Fully asynchronous operations to maintain application responsiveness
- Event-driven architecture for handling various stages of interaction
- Built-in audio streaming capabilities with configurable parameters
- Custom tool integration to extend functionality
- Comprehensive error handling and reconnection support
- Flexible configuration options for generation settings, safety parameters, and system instructions
Getting Started
Let's walk through the process of setting up and using the Multimodal Live API in your C# application.
Installation
First, you need to install the NuGet package:
Install-Package Google_GenerativeAI.Live
Install-Package NAudio
Basic Setup and Usage
Here's how to create a basic implementation that handles both text and audio:
// Create the client with the desired configuration
var client = new MultiModalLiveClient(
platformAdapter: new GoogleAIPlatformAdapter(),
modelName: "gemini-1.5-flash-exp",
generationConfig: new GenerationConfig
{
ResponseModalities = { Modality.TEXT, Modality.AUDIO }
},
safetySettings: null,
systemInstruction: "You are a helpful assistant."
);
// Subscribe to events
client.Connected += (s, e) => Console.WriteLine("Connected to Gemini!");
client.TextChunkReceived += (s, e) => Console.WriteLine($"Gemini: {e.TextChunk}");
client.AudioChunkReceived += (s, e) => Console.WriteLine($"Audio received: {e.Buffer.Length} bytes");
// Connect and send messages
await client.ConnectAsync();
await client.SentTextAsync("Hello, Gemini!");
Streaming Audio in Real-Time
One of the most powerful features of the Multimodal Live API is the ability to stream audio directly to Gemini as it's being captured, without waiting for a complete recording:
// Setup real-time audio streaming Using NAudio
_waveIn = new WaveInEvent { WaveFormat = new WaveFormat(16000, 1) };
_waveIn.DataAvailable += async (s, e) => {
// Send audio chunks as they arrive
if (_isListening)
{
await _client.SendAudioAsync(
audioData: e.Buffer,
audioContentType: "audio/pcm; rate=16000"
);
}
};
// Start listening and streaming
public void StartRealTimeListening()
{
_isListening = true;
_waveIn.StartRecording();
Console.WriteLine("Streaming audio to Gemini in real-time...");
}
// Stop listening
public void StopListening()
{
_isListening = false;
_waveIn.StopRecording();
}
This approach creates a much more responsive experience, as the model can begin processing speech before the user has finished talking. It's particularly useful for creating natural-feeling voice assistants that can provide feedback with minimal latency.
Real-Time Audio Playback
To complement real-time audio streaming, here's how to set up immediate playback of Gemini's audio responses:
private BufferedWaveProvider _bufferedWaveProvider;
private WaveOutEvent _waveOut;
// Setup audio playback system
private void InitializeAudioPlayback()
{
var waveFormat = new WaveFormat(16000, 16, 1);
_bufferedWaveProvider = new BufferedWaveProvider(waveFormat);
_waveOut = new WaveOutEvent();
_waveOut.Init(_bufferedWaveProvider);
}
// Handle incoming audio chunks for immediate playback
client.AudioChunkReceived += (s, e) => {
// Add received audio to buffer
_bufferedWaveProvider.AddSamples(e.Buffer, 0, e.Buffer.Length);
// Start playback if not already playing
if (_waveOut.PlaybackState != PlaybackState.Playing)
{
_waveOut.Play();
}
};
// Handle playback completion
client.AudioReceiveCompleted += (s, e) => {
Console.WriteLine("Audio response complete");
// Optional: wait for buffered audio to finish playing
};
This approach creates a seamless conversational experience by playing audio responses as soon as they arrive, rather than waiting for the complete response. The BufferedWaveProvider
handles the streaming nature of the audio data, ensuring smooth playback even as new chunks are being received.
Event-Driven Architecture
The MultiModalLiveClient
exposes several key events:
// Core connection events
client.Connected += (s, e) => Console.WriteLine("Connected!");
client.Disconnected += (s, e) => Console.WriteLine("Disconnected");
client.ErrorOccurred += (s, e) => Console.WriteLine($"Error: {e.Exception.Message}");
// Data events
client.TextChunkReceived += (s, e) => ProcessTextChunk(e.TextChunk);
client.AudioChunkReceived += (s, e) => StoreAudioChunk(e.Buffer);
client.AudioReceiveCompleted += (s, e) => PlayStoredAudio();
This event-based approach makes it easy to build responsive applications that can handle the asynchronous nature of real-time communication.
Advanced Features and Configurations
Customize how Gemini generates responses:
// Generation config
var generationConfig = new GenerationConfig
{
Temperature = 0.7f,
TopP = 0.95f,
ResponseModalities = { Modality.TEXT, Modality.AUDIO }
};
// Safety settings
var safetySettings = new List<SafetySetting>
{
new SafetySetting
{
Category = HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
Threshold = HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
}
};
// System instructions
var systemInstruction = "You are a helpful technical support assistant.";
Building a Simple Voice Assistant
Here's a quick example bringing it all together in a voice assistant application:
public class SimpleVoiceAssistant
{
private MultiModalLiveClient _client;
private WaveInEvent _waveIn;
private BufferedWaveProvider _bufferedWaveProvider;
private WaveOutEvent _waveOut;
private bool _isListening = false;
public async Task InitializeAsync()
{
// Initialize client
_client = new MultiModalLiveClient(
platformAdapter: new GoogleAIPlatformAdapter(),
modelName: "gemini-1.5-flash-exp",
generationConfig: new GenerationConfig
{
ResponseModalities = { Modality.TEXT, Modality.AUDIO }
},
safetySettings: null,
systemInstruction: "You are a helpful voice assistant."
);
// Set up events
SetupEvents();
// Initialize audio
SetupAudio();
// Connect
await _client.ConnectAsync();
}
private void SetupEvents()
{
_client.TextChunkReceived += (s, e) => Console.WriteLine($"Gemini: {e.TextChunk}");
_client.AudioChunkReceived += HandleAudioChunk;
}
private void SetupAudio()
{
// Input setup
_waveIn = new WaveInEvent { WaveFormat = new WaveFormat(16000, 1) };
_waveIn.DataAvailable += HandleMicrophoneData;
// Output setup
var waveFormat = new WaveFormat(16000, 16, 1);
_bufferedWaveProvider = new BufferedWaveProvider(waveFormat);
_waveOut = new WaveOutEvent();
_waveOut.Init(_bufferedWaveProvider);
}
private async void HandleMicrophoneData(object sender, WaveInEventArgs e)
{
if (_isListening)
{
await _client.SendAudioAsync(e.Buffer, "audio/pcm; rate=16000");
}
}
private void HandleAudioChunk(object sender, AudioChunkEventArgs e)
{
_bufferedWaveProvider.AddSamples(e.Buffer, 0, e.Buffer.Length);
if (_waveOut.PlaybackState != PlaybackState.Playing)
{
_waveOut.Play();
}
}
public void StartListening()
{
_isListening = true;
_waveIn.StartRecording();
}
public void StopListening()
{
_isListening = false;
_waveIn.StopRecording();
}
}
Check out the complete WPF Sample with Source Code
Conclusion
The Google_GenerativeAI.Live package brings powerful real-time, multimodal capabilities to C# developers. By leveraging WebSockets and an event-driven architecture, it enables the creation of highly interactive applications that can seamlessly blend text and voice interactions with Gemini's advanced AI capabilities.
Whether you're building a voice assistant, an interactive chat application, or a multimodal customer service solution, this API provides the foundation you need to create responsive, engaging experiences.
Have you built something interesting with the Multimodal Live API? I'd love to hear about your experiences in the comments!
Top comments (0)