Gunpal Jain

Posted on Mar 22

Google Gemini's Multimodal Live API with C# .Net

#ai #gemini #programming #csharp

Real-time communication with AI models opens up exciting possibilities for creating interactive applications. Google's Gemini models now support live, multimodal interactions. In this article, I'll walk through how to use this powerful capability to build responsive, voice-enabled applications using Google_GenerativeAI SDK for C# .NET.

Introducing the Multimodal Live API

The Google_GenerativeAI SDK has recently expanded with the Google_GenerativeAI.Live package, which provides a seamless interface to the Google Multimodal Live API. This implementation leverages WebSockets to enable real-time, bidirectional communication with Gemini models, allowing for dynamic exchange of both text and audio data.

What makes this API particularly valuable is its ability to handle multiple modalities simultaneously, making it perfect for applications that require natural conversations through text, voice, or a combination of both.

Key Features at a Glance

The Google_GenerativeAI.Live package comes with a robust set of features designed to support sophisticated application development:

Real-time, bidirectional communication for truly interactive experiences
Support for multiple response modalities (text and audio)
Fully asynchronous operations to maintain application responsiveness
Event-driven architecture for handling various stages of interaction
Built-in audio streaming capabilities with configurable parameters
Custom tool integration to extend functionality
Comprehensive error handling and reconnection support
Flexible configuration options for generation settings, safety parameters, and system instructions

Getting Started

Let's walk through the process of setting up and using the Multimodal Live API in your C# application.

Installation

First, you need to install the NuGet package:

Install-Package Google_GenerativeAI.Live
Install-Package NAudio

Basic Setup and Usage

Here's how to create a basic implementation that handles both text and audio:

// Create the client with the desired configuration
var client = new MultiModalLiveClient(
    platformAdapter: new GoogleAIPlatformAdapter(), 
    modelName: "gemini-1.5-flash-exp", 
    generationConfig: new GenerationConfig 
    { 
        ResponseModalities = { Modality.TEXT, Modality.AUDIO } 
    }, 
    safetySettings: null, 
    systemInstruction: "You are a helpful assistant."
);

// Subscribe to events
client.Connected += (s, e) => Console.WriteLine("Connected to Gemini!");
client.TextChunkReceived += (s, e) => Console.WriteLine($"Gemini: {e.TextChunk}");
client.AudioChunkReceived += (s, e) => Console.WriteLine($"Audio received: {e.Buffer.Length} bytes");

// Connect and send messages
await client.ConnectAsync();
await client.SentTextAsync("Hello, Gemini!");

Streaming Audio in Real-Time

One of the most powerful features of the Multimodal Live API is the ability to stream audio directly to Gemini as it's being captured, without waiting for a complete recording:

// Setup real-time audio streaming Using NAudio
_waveIn = new WaveInEvent { WaveFormat = new WaveFormat(16000, 1) };
_waveIn.DataAvailable += async (s, e) => {
    // Send audio chunks as they arrive
    if (_isListening)
    {
        await _client.SendAudioAsync(
            audioData: e.Buffer,
            audioContentType: "audio/pcm; rate=16000"
        );
    }
};

// Start listening and streaming
public void StartRealTimeListening()
{
    _isListening = true;
    _waveIn.StartRecording();
    Console.WriteLine("Streaming audio to Gemini in real-time...");
}

// Stop listening
public void StopListening()
{
    _isListening = false;
    _waveIn.StopRecording();
}

This approach creates a much more responsive experience, as the model can begin processing speech before the user has finished talking. It's particularly useful for creating natural-feeling voice assistants that can provide feedback with minimal latency.

Real-Time Audio Playback

To complement real-time audio streaming, here's how to set up immediate playback of Gemini's audio responses:

private BufferedWaveProvider _bufferedWaveProvider;
private WaveOutEvent _waveOut;

// Setup audio playback system
private void InitializeAudioPlayback()
{
    var waveFormat = new WaveFormat(16000, 16, 1);
    _bufferedWaveProvider = new BufferedWaveProvider(waveFormat);
    _waveOut = new WaveOutEvent();
    _waveOut.Init(_bufferedWaveProvider);
}

// Handle incoming audio chunks for immediate playback
client.AudioChunkReceived += (s, e) => {
    // Add received audio to buffer
    _bufferedWaveProvider.AddSamples(e.Buffer, 0, e.Buffer.Length);

    // Start playback if not already playing
    if (_waveOut.PlaybackState != PlaybackState.Playing)
    {
        _waveOut.Play();
    }
};

// Handle playback completion
client.AudioReceiveCompleted += (s, e) => {
    Console.WriteLine("Audio response complete");
    // Optional: wait for buffered audio to finish playing
};

This approach creates a seamless conversational experience by playing audio responses as soon as they arrive, rather than waiting for the complete response. The BufferedWaveProvider handles the streaming nature of the audio data, ensuring smooth playback even as new chunks are being received.

Event-Driven Architecture

The MultiModalLiveClient exposes several key events:

// Core connection events
client.Connected += (s, e) => Console.WriteLine("Connected!");
client.Disconnected += (s, e) => Console.WriteLine("Disconnected");
client.ErrorOccurred += (s, e) => Console.WriteLine($"Error: {e.Exception.Message}");

// Data events
client.TextChunkReceived += (s, e) => ProcessTextChunk(e.TextChunk);
client.AudioChunkReceived += (s, e) => StoreAudioChunk(e.Buffer);
client.AudioReceiveCompleted += (s, e) => PlayStoredAudio();

This event-based approach makes it easy to build responsive applications that can handle the asynchronous nature of real-time communication.

Advanced Features and Configurations

Customize how Gemini generates responses:

// Generation config
var generationConfig = new GenerationConfig
{
    Temperature = 0.7f,
    TopP = 0.95f,
    ResponseModalities = { Modality.TEXT, Modality.AUDIO }
};

// Safety settings
var safetySettings = new List<SafetySetting>
{
    new SafetySetting
    {
        Category = HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
        Threshold = HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
    }
};

// System instructions
var systemInstruction = "You are a helpful technical support assistant.";

Building a Simple Voice Assistant

Here's a quick example bringing it all together in a voice assistant application:

public class SimpleVoiceAssistant
{
    private MultiModalLiveClient _client;
    private WaveInEvent _waveIn;
    private BufferedWaveProvider _bufferedWaveProvider;
    private WaveOutEvent _waveOut;
    private bool _isListening = false;

    public async Task InitializeAsync()
    {
        // Initialize client
        _client = new MultiModalLiveClient(
            platformAdapter: new GoogleAIPlatformAdapter(), 
            modelName: "gemini-1.5-flash-exp", 
            generationConfig: new GenerationConfig 
            { 
                ResponseModalities = { Modality.TEXT, Modality.AUDIO } 
            }, 
            safetySettings: null, 
            systemInstruction: "You are a helpful voice assistant."
        );

        // Set up events
        SetupEvents();

        // Initialize audio
        SetupAudio();

        // Connect
        await _client.ConnectAsync();
    }

    private void SetupEvents()
    {
        _client.TextChunkReceived += (s, e) => Console.WriteLine($"Gemini: {e.TextChunk}");
        _client.AudioChunkReceived += HandleAudioChunk;
    }

    private void SetupAudio()
    {
        // Input setup
        _waveIn = new WaveInEvent { WaveFormat = new WaveFormat(16000, 1) };
        _waveIn.DataAvailable += HandleMicrophoneData;

        // Output setup
        var waveFormat = new WaveFormat(16000, 16, 1);
        _bufferedWaveProvider = new BufferedWaveProvider(waveFormat);
        _waveOut = new WaveOutEvent();
        _waveOut.Init(_bufferedWaveProvider);
    }

    private async void HandleMicrophoneData(object sender, WaveInEventArgs e)
    {
        if (_isListening)
        {
            await _client.SendAudioAsync(e.Buffer, "audio/pcm; rate=16000");
        }
    }

    private void HandleAudioChunk(object sender, AudioChunkEventArgs e)
    {
        _bufferedWaveProvider.AddSamples(e.Buffer, 0, e.Buffer.Length);

        if (_waveOut.PlaybackState != PlaybackState.Playing)
        {
            _waveOut.Play();
        }
    }

    public void StartListening()
    {
        _isListening = true;
        _waveIn.StartRecording();
    }

    public void StopListening()
    {
        _isListening = false;
        _waveIn.StopRecording();
    }
}

Check out the complete WPF Sample with Source Code

Conclusion

The Google_GenerativeAI.Live package brings powerful real-time, multimodal capabilities to C# developers. By leveraging WebSockets and an event-driven architecture, it enables the creation of highly interactive applications that can seamlessly blend text and voice interactions with Gemini's advanced AI capabilities.

Whether you're building a voice assistant, an interactive chat application, or a multimodal customer service solution, this API provides the foundation you need to create responsive, engaging experiences.

Have you built something interesting with the Multimodal Live API? I'd love to hear about your experiences in the comments!

Deploy with ease. Manage efficiently. Scale faster.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

DEV Community