Sagar Kava

Posted on Jun 20

Build a AI Voice Agent with Gemini API

#webdev #ai #python #programming

Learn how to build a fully functional, real-time AI voice agent you can talk to, using Google's GeminiAPI and VideoSDK for robust AI Voice Agent.

Ever wondered how you could talk to an AI, not by typing, but in a natural, real-time conversation? Imagine building a virtual doctor for initial consultations, an AI tutor that explains complex topics, or even a friendly companion to chat with.

Today, we're going to build just that. We'll create a fully functional, real-time AI voice agent that you can talk to directly in your browser. The agent will listen to you, understand what you're saying, and respond with a natural-sounding voice, all in real-time.

We will use the power of Google's Gemini Realtime API for lightning-fast conversational AI, and the robust infrastructure of VideoSDK to handle real-time audio streaming and session management. By the end of this tutorial, you'll have a working app with a React frontend and a Python backend that you can customize and expand upon.

Here's a quick peek at what we're building:

Prerequisites

Before we start, make sure you have the following ready:

Node.js (v16+) and npm/yarn – For our React frontend.
Python (3.8+) and pip – For our FastAPI backend.
A VideoSDK Account – To get your Auth Token for session management.
A Google Account – To get your free Gemini API key from AI Studio.

How to Get Your VideoSDK Auth Token

Your application needs an Auth Token to connect to VideoSDK.

Sign Up: Go to videosdk.live and create a free account.
Navigate to API Keys: Once you're on the dashboard, find "API Keys" in the left-hand menu.
Generate a Token: You'll see your API Key and a "Generate Token" button. Click it to create a new, temporary token.
Copy Your Token: Copy the generated token. This is the value you'll use for VIDEOSDK_TOKEN in your backend and VITE_VIDEOSDK_TOKEN in your frontend. For development, this token is fine, but for production apps, you should generate tokens securely from a server.

How to Get Your Google Gemini API Key

We will use Google AI Studio to get a free API key. This is the simplest way to start building with Gemini.

Go to AI Studio: Open your browser and navigate to https://aistudio.google.com/.
Sign In: Sign in with your Google account.
Get API Key: Look for and click the Get API Key button, usually located in the top left corner.
Create API Key: In the pop-up window, click Create API key in new project.
Copy Your Key: Your new API key will be displayed. Copy it immediately and save it somewhere safe. This is the value you'll use for GOOGLE_API_KEY in your backend .env file.

Project Structure

We'll keep things simple with a monorepo structure.

/gemini-voice-agent
├── client/         # React Frontend
└── server.py       # Python FastAPI Backend

Backend Setup

Let's start by creating our Python server, which will manage the agent's connection to the meeting.

1. Create virtual environment & install dependencies

In your project root, set up a Python virtual environment.

# In the root /gemini-voice-agent directory
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install --upgrade pip
pip install fastapi uvicorn python-dotenv "videosdk-agents[google]"

The videosdk-agents[google] package conveniently bundles the core agent SDK with the necessary google plugins.

2. Create `.env` file in the project root

Create a file named .env in the root of your project and add your secret keys.

# .env
GOOGLE_API_KEY=your_google_api_key_from_ai_studio
VIDEOSDK_TOKEN=your_videosdk_auth_token_here

3. Create `server.py`

This file contains all our backend logic. It will expose two endpoints: one to make the agent join a meeting and one to make it leave.

# server.py
from fastapi.middleware.cors import CORSMiddleware
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from videosdk.agents import Agent, AgentSession, RealTimePipeline
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import os
import uvicorn
from dotenv import load_dotenv
import asyncio
from typing import Dict

# Load environment variables from .env file
load_dotenv()

# Configuration
PORT = int(os.getenv("PORT", 8000))

# Initialize FastAPI app
app = FastAPI()

# Add CORS middleware to allow cross-origin requests
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# In-memory storage for active agent sessions
active_sessions: Dict[str, AgentSession] = {}

# Define the Agent's behavior
class MyVoiceAgent(Agent):
    def __init__(self, system_prompt: str, personality: str):
        super().__init__(instructions=system_prompt)
        self.personality = personality

    async def on_enter(self) -> None:
        # A simple greeting when the agent joins
        await self.session.say("Hey, I'm Gemini. How can I help you today?")

    async def on_exit(self) -> None:
        # A simple goodbye when the agent leaves
        await self.session.say("It was nice talking to you. Goodbye!")

# Pydantic models for request body validation
class MeetingReqConfig(BaseModel):
    meeting_id: str
    token: str
    model: str
    voice: str
    personality: str
    temperature: float
    system_prompt: str
    topP: float
    topK: float

class LeaveAgentReqConfig(BaseModel):
    meeting_id: str

# This function contains the long-running agent logic
async def server_operations(req: MeetingReqConfig):
    meeting_id = req.meeting_id

    # Configure Gemini Realtime model from Google
    model = GeminiRealtime(
        model=req.model,
        api_key=os.getenv("GOOGLE_API_KEY"),
        config=GeminiLiveConfig(
            voice=req.voice,
            response_modalities=["AUDIO"], # We only want audio back
            temperature=req.temperature,
            top_p=req.topP,
            top_k=int(req.topK),
        )
    )

    # Create a real-time pipeline to connect the model to the session
    pipeline = RealTimePipeline(model=model)

    # Create the agent session
    session = AgentSession(
        agent=MyVoiceAgent(req.system_prompt, req.personality),
        pipeline=pipeline,
        context={
            "meetingId": meeting_id,
            "name": "Gemini Agent",
            "videosdk_auth": req.token,
        }
    )

    # Store the session so we can manage it later
    active_sessions[meeting_id] = session

    try:
        # Start the session (this is a long-running, blocking call)
        await session.start()
    except Exception as ex:
        print(f"[{meeting_id}] [ERROR] in agent session: {ex}")
    finally:
        # Clean up the session from our active list once it ends
        if active_sessions.get(meeting_id) is session:
            active_sessions.pop(meeting_id, None)

# API endpoint to make the agent join a meeting
@app.post("/join-agent")
async def join_agent(req: MeetingReqConfig, bg_tasks: BackgroundTasks):
    # Run the long-running agent task in the background
    bg_tasks.add_task(server_operations, req)
    return {"message": f"Agent joining meeting {req.meeting_id}"}

# API endpoint to make the agent leave a meeting
@app.post("/leave-agent")
async def leave_agent(req: LeaveAgentReqConfig):
    session = active_sessions.pop(req.meeting_id, None)
    if session:
        # If a session is found, stop it gracefully
        await session.stop()
        return {"status": "removed"}
    return {"status": "not_found"}

if __name__ == "__main__":
    uvicorn.run("server:app", host="0.0.0.0", port=PORT, reload=True)

Breaking Down the Backend

MyVoiceAgent: This class defines our agent's personality and behavior. The instructions parameter in super().__init__ is the system prompt that tells Gemini its role. on_enter and on_exit are lifecycle hooks for greetings and goodbyes.
AgentSession: This is the core component from the VideoSDK Agent SDK. It manages the agent's connection to the VideoSDK meeting room, handling all the complex real-time communication protocols.
GeminiRealtime: This plugin configures the connection to Google's Gemini API, including the model, voice, and response parameters.
BackgroundTasks: The session.start() method is a blocking call that runs as long as the agent is in the meeting. We use FastAPI's BackgroundTasks to run it without freezing our API, allowing us to immediately return a response to the frontend.
active_sessions: This dictionary is a simple way to keep track of running sessions. This allows our /leave-agent endpoint to find and gracefully shut down the correct agent.

Frontend Setup

Now let's build the user interface where we can talk to our agent.

1. Create a new React + TypeScript project

Navigate to your project root and use Vite to scaffold a new app.

# From the root /gemini-voice-agent directory
npm create vite@latest client -- --template react-ts
cd client
npm install

2. Install dependencies

We need the VideoSDK React SDK for meeting controls, lucide-react for icons, and TailwindCSS for styling.

npm install @videosdk.live/react-sdk lucide-react tailwindcss postcss autoprefixer
npx tailwindcss init -p

3. Configure Tailwind CSS

Update tailwind.config.js to tell Tailwind which files to scan for classes.

// client/tailwind.config.js
/** @type {import('tailwindcss').Config} */
export default {
  content: [
    "./index.html",
    "./src/**/*.{js,ts,jsx,tsx}",
  ],
  theme: {
    extend: {},
  },
  plugins: [],
}

Then, add the Tailwind directives to your main CSS file.

/* client/src/index.css */
@tailwind base;
@tailwind components;
@tailwind utilities;

4. Create a frontend `.env` file

In the client directory, create a .env file for your client-side environment variables.

# client/.env
VITE_VIDEOSDK_TOKEN=your_videosdk_auth_token_here
VITE_API_URL=http://localhost:8000

Note: Vite requires environment variables exposed to the browser to be prefixed with VITE_.

5. Build the React User Interface

Replace the contents of client/src/App.tsx with the following code. This component will handle creating a meeting, joining it, inviting the agent, and playing the agent's audio.

// client/src/App.tsx
import React, { useEffect, useMemo, useRef, useState } from "react";
import {
  MeetingProvider,
  useMeeting,
  useParticipant,
} from "@videosdk.live/react-sdk";
import { Mic, MicOff } from "lucide-react";

// Component to play the agent's audio
const AgentAudioPlayer = ({ participantId }: { participantId: string }) => {
  const { micStream, isMicOn } = useParticipant(participantId);
  const audioRef = useRef<HTMLAudioElement>(null);

  useEffect(() => {
    if (audioRef.current && micStream) {
      const mediaStream = new MediaStream();
      mediaStream.addTrack(micStream.track);
      audioRef.current.srcObject = mediaStream;
      audioRef.current.play().catch((err) => {
        console.error("Error playing audio:", err);
      });
    }
  }, [micStream]);

  return <>{isMicOn && <audio ref={audioRef} autoPlay style={{ display: "none" }} />}</>;
};

// Main meeting view component
const MeetingView = ({ meetingId, onLeave }: { meetingId: string, onLeave: () => void }) => {
  const { join, leave, toggleMic, localParticipant, participants } = useMeeting();

  const agentParticipant = useMemo(
    () => [...participants.values()].find(p => p.displayName === "Gemini Agent"),
    [participants]
  );

  useEffect(() => {
    // Call the backend to invite the agent when the user joins
    fetch(`${import.meta.env.VITE_API_URL}/join-agent`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        meeting_id: meetingId,
        token: import.meta.env.VITE_VIDEOSDK_TOKEN,
        // Agent Configuration - you can customize this
        model: "gemini-1.5-pro-latest",
        voice: "en-US-Standard-C",
        personality: "friendly",
        temperature: 0.9,
        system_prompt: "You are a helpful AI assistant. Keep your responses concise and to the point.",
        topP: 1,
        topK: 40,
      }),
    });

    return () => {
      // Call the backend to make the agent leave when the component unmounts
      fetch(`${import.meta.env.VITE_API_URL}/leave-agent`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ meeting_id: meetingId }),
      });
    };
  }, [meetingId]);

  return (
    <div className="flex flex-col items-center justify-center h-full p-4">
      <h2 className="text-2xl font-bold mb-4">Meeting ID: {meetingId}</h2>
      <div className="flex items-center justify-center space-x-4">
        <div className="flex flex-col items-center">
          <div className="w-24 h-24 bg-gray-700 rounded-full flex items-center justify-center">
            <p className="text-3xl">You</p>
          </div>
          <p>{localParticipant.displayName}</p>
        </div>
        {agentParticipant ? (
          <div className="flex flex-col items-center">
            <div className="w-24 h-24 bg-blue-600 rounded-full flex items-center justify-center">
              <p className="text-3xl">AI</p>
            </div>
            <p>{agentParticipant.displayName}</p>
            <AgentAudioPlayer participantId={agentParticipant.id} />
          </div>
        ) : (
          <div className="flex flex-col items-center">
             <div className="w-24 h-24 bg-gray-700 rounded-full flex items-center justify-center animate-pulse">
              <p className="text-3xl">AI</p>
            </div>
            <p>Agent is joining...</p>
          </div>
        )}
      </div>

      <div className="mt-8 flex space-x-4">
        <button onClick={() => toggleMic()} className="p-3 bg-gray-600 rounded-full">
          {localParticipant.isMicOn ? <Mic /> : <MicOff />}
        </button>
        <button onClick={() => {
          leave();
          onLeave();
        }} className="px-4 py-2 bg-red-500 rounded-lg">Leave</button>
      </div>
    </div>
  );
};

// Main App component
function App() {
  const [meetingId, setMeetingId] = useState<string | null>(null);

  const createMeeting = async () => {
    try {
      const res = await fetch(`https://api.videosdk.live/v2/rooms`, {
        method: "POST",
        headers: {
          authorization: `${import.meta.env.VITE_VIDEOSDK_TOKEN}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({}),
      });
      const { roomId } = await res.json();
      setMeetingId(roomId);
    } catch (error) {
      console.error("Error creating meeting:", error);
      alert("Failed to create meeting.");
    }
  };

  const onMeetingLeave = () => {
    setMeetingId(null);
  }

  return meetingId ? (
    <MeetingProvider
      config={{
        meetingId,
        micEnabled: true,
        webcamEnabled: false,
        name: "User",
        token: import.meta.env.VITE_VIDEOSDK_TOKEN,
      }}
      joinWithoutUserInteraction
    >
      <MeetingView meetingId={meetingId} onLeave={onMeetingLeave} />
    </MeetingProvider>
  ) : (
    <div className="h-screen w-screen flex flex-col items-center justify-center bg-gray-800 text-white">
      <h1 className="text-4xl font-bold mb-8">Gemini AI Voice Agent</h1>
      <button onClick={createMeeting} className="px-6 py-3 bg-blue-600 rounded-lg text-xl">
        Start a Conversation
      </button>
    </div>
  );
}

export default App;

Breaking Down the Frontend

MeetingProvider: This is the top-level wrapper from the VideoSDK React SDK. It provides the meeting context to all child components.
useMeeting() hook: This powerful hook gives us access to all essential meeting functions like join, leave, toggleMic, and the list of participants.
useParticipant() hook: This hook provides real-time information about a specific participant, including their micStream, which contains the raw audio data.
AgentAudioPlayer Component: This small component is crucial. It takes the agent's participantId, gets the micStream using the useParticipant hook, and pipes it into a standard HTML <audio> element to be played.
API Calls: When the user joins the meeting, a useEffect hook fires a POST request to our /join-agent backend endpoint. The cleanup function of the useEffect fires when the user leaves, calling the /leave-agent endpoint to ensure the agent is removed from the call and server resources are freed.

🚀 Run the App

It's time to see our creation in action! You'll need two separate terminal windows.

1. Start the Backend Server

In your first terminal, at the project root:

# Make sure your virtual environment is active
source venv/bin/activate 

# Start the server
uvicorn server:app --host 0.0.0.0 --port 8000 --reload

2. Start the Frontend App

In your second terminal, navigate to the client directory:

cd client
npm run dev

Now, open your browser and go to http://localhost:5173. Click "Start a Conversation," allow microphone permissions, and start talking to your very own AI agent!

Conclusion

Congratulations! You've successfully built a fully functional, real-time AI voice agent using Google Gemini and VideoSDK. You've learned how to:

Set up a Python backend to manage an AI agent.
Connect to Google's Gemini Realtime API for conversational AI.
Use VideoSDK to handle real-time audio streaming and session management.
Build a React frontend to interact with the agent in a browser.

This is just the beginning. You can now customize the agent's system prompt, personality, and even give it new tools and capabilities.

If you build something cool with this, we'd love to see it. Share it on X/Twitter and tag @video_sdk!

Happy hacking! 🚀

DEV Community

Build a AI Voice Agent with Gemini API

Prerequisites

How to Get Your VideoSDK Auth Token

How to Get Your Google Gemini API Key

Project Structure

Backend Setup

1. Create virtual environment & install dependencies

2. Create `.env` file in the project root

3. Create `server.py`

Breaking Down the Backend

Frontend Setup

1. Create a new React + TypeScript project

2. Install dependencies

3. Configure Tailwind CSS

4. Create a frontend `.env` file

5. Build the React User Interface

Breaking Down the Frontend

🚀 Run the App

1. Start the Backend Server

2. Start the Frontend App

Conclusion

Top comments (0)

Prerequisites

How to Get Your VideoSDK Auth Token

How to Get Your Google Gemini API Key

Project Structure

Backend Setup

1. Create virtual environment & install dependencies

2. Create .env file in the project root

3. Create server.py

Breaking Down the Backend

Frontend Setup

1. Create a new React + TypeScript project

2. Install dependencies

3. Configure Tailwind CSS

4. Create a frontend .env file

5. Build the React User Interface

Breaking Down the Frontend

🚀 Run the App

1. Start the Backend Server

2. Start the Frontend App

Conclusion

2. Create `.env` file in the project root

3. Create `server.py`

4. Create a frontend `.env` file