Learn how to build a fully functional, real-time AI voice agent you can talk to, using Google's GeminiAPI and VideoSDK for robust AI Voice Agent.
Ever wondered how you could talk to an AI, not by typing, but in a natural, real-time conversation? Imagine building a virtual doctor for initial consultations, an AI tutor that explains complex topics, or even a friendly companion to chat with.
Today, we're going to build just that. We'll create a fully functional, real-time AI voice agent that you can talk to directly in your browser. The agent will listen to you, understand what you're saying, and respond with a natural-sounding voice, all in real-time.
We will use the power of Google's Gemini Realtime API for lightning-fast conversational AI, and the robust infrastructure of VideoSDK to handle real-time audio streaming and session management. By the end of this tutorial, you'll have a working app with a React frontend and a Python backend that you can customize and expand upon.
Here's a quick peek at what we're building:
Prerequisites
Before we start, make sure you have the following ready:
- Node.js (v16+) and npm/yarn โ For our React frontend.
- Python (3.8+) and pip โ For our FastAPI backend.
- A VideoSDK Account โ To get your Auth Token for session management.
- A Google Account โ To get your free Gemini API key from AI Studio.
How to Get Your VideoSDK Auth Token
Your application needs an Auth Token to connect to VideoSDK.
- Sign Up: Go to videosdk.live and create a free account.
- Navigate to API Keys: Once you're on the dashboard, find "API Keys" in the left-hand menu.
- Generate a Token: You'll see your API Key and a "Generate Token" button. Click it to create a new, temporary token.
- Copy Your Token: Copy the generated token. This is the value you'll use for
VIDEOSDK_TOKEN
in your backend andVITE_VIDEOSDK_TOKEN
in your frontend. For development, this token is fine, but for production apps, you should generate tokens securely from a server.
How to Get Your Google Gemini API Key
We will use Google AI Studio to get a free API key. This is the simplest way to start building with Gemini.
- Go to AI Studio: Open your browser and navigate to https://aistudio.google.com/.
- Sign In: Sign in with your Google account.
- Get API Key: Look for and click the
Get API Key
button, usually located in the top left corner. - Create API Key: In the pop-up window, click
Create API key in new project
. - Copy Your Key: Your new API key will be displayed. Copy it immediately and save it somewhere safe. This is the value you'll use for
GOOGLE_API_KEY
in your backend.env
file.
Project Structure
We'll keep things simple with a monorepo structure.
/gemini-voice-agent
โโโ client/ # React Frontend
โโโ server.py # Python FastAPI Backend
Backend Setup
Let's start by creating our Python server, which will manage the agent's connection to the meeting.
1. Create virtual environment & install dependencies
In your project root, set up a Python virtual environment.
# In the root /gemini-voice-agent directory
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install --upgrade pip
pip install fastapi uvicorn python-dotenv "videosdk-agents[google]"
The
videosdk-agents[google]
package conveniently bundles the core agent SDK with the necessary google plugins.
2. Create .env
file in the project root
Create a file named .env
in the root of your project and add your secret keys.
# .env
GOOGLE_API_KEY=your_google_api_key_from_ai_studio
VIDEOSDK_TOKEN=your_videosdk_auth_token_here
3. Create server.py
This file contains all our backend logic. It will expose two endpoints: one to make the agent join a meeting and one to make it leave.
# server.py
from fastapi.middleware.cors import CORSMiddleware
from fastapi import FastAPI, BackgroundTasks
from pydantic import BaseModel
from videosdk.agents import Agent, AgentSession, RealTimePipeline
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import os
import uvicorn
from dotenv import load_dotenv
import asyncio
from typing import Dict
# Load environment variables from .env file
load_dotenv()
# Configuration
PORT = int(os.getenv("PORT", 8000))
# Initialize FastAPI app
app = FastAPI()
# Add CORS middleware to allow cross-origin requests
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# In-memory storage for active agent sessions
active_sessions: Dict[str, AgentSession] = {}
# Define the Agent's behavior
class MyVoiceAgent(Agent):
def __init__(self, system_prompt: str, personality: str):
super().__init__(instructions=system_prompt)
self.personality = personality
async def on_enter(self) -> None:
# A simple greeting when the agent joins
await self.session.say("Hey, I'm Gemini. How can I help you today?")
async def on_exit(self) -> None:
# A simple goodbye when the agent leaves
await self.session.say("It was nice talking to you. Goodbye!")
# Pydantic models for request body validation
class MeetingReqConfig(BaseModel):
meeting_id: str
token: str
model: str
voice: str
personality: str
temperature: float
system_prompt: str
topP: float
topK: float
class LeaveAgentReqConfig(BaseModel):
meeting_id: str
# This function contains the long-running agent logic
async def server_operations(req: MeetingReqConfig):
meeting_id = req.meeting_id
# Configure Gemini Realtime model from Google
model = GeminiRealtime(
model=req.model,
api_key=os.getenv("GOOGLE_API_KEY"),
config=GeminiLiveConfig(
voice=req.voice,
response_modalities=["AUDIO"], # We only want audio back
temperature=req.temperature,
top_p=req.topP,
top_k=int(req.topK),
)
)
# Create a real-time pipeline to connect the model to the session
pipeline = RealTimePipeline(model=model)
# Create the agent session
session = AgentSession(
agent=MyVoiceAgent(req.system_prompt, req.personality),
pipeline=pipeline,
context={
"meetingId": meeting_id,
"name": "Gemini Agent",
"videosdk_auth": req.token,
}
)
# Store the session so we can manage it later
active_sessions[meeting_id] = session
try:
# Start the session (this is a long-running, blocking call)
await session.start()
except Exception as ex:
print(f"[{meeting_id}] [ERROR] in agent session: {ex}")
finally:
# Clean up the session from our active list once it ends
if active_sessions.get(meeting_id) is session:
active_sessions.pop(meeting_id, None)
# API endpoint to make the agent join a meeting
@app.post("/join-agent")
async def join_agent(req: MeetingReqConfig, bg_tasks: BackgroundTasks):
# Run the long-running agent task in the background
bg_tasks.add_task(server_operations, req)
return {"message": f"Agent joining meeting {req.meeting_id}"}
# API endpoint to make the agent leave a meeting
@app.post("/leave-agent")
async def leave_agent(req: LeaveAgentReqConfig):
session = active_sessions.pop(req.meeting_id, None)
if session:
# If a session is found, stop it gracefully
await session.stop()
return {"status": "removed"}
return {"status": "not_found"}
if __name__ == "__main__":
uvicorn.run("server:app", host="0.0.0.0", port=PORT, reload=True)
Breaking Down the Backend
-
MyVoiceAgent
: This class defines our agent's personality and behavior. Theinstructions
parameter insuper().__init__
is the system prompt that tells Gemini its role.on_enter
andon_exit
are lifecycle hooks for greetings and goodbyes. -
AgentSession
: This is the core component from the VideoSDK Agent SDK. It manages the agent's connection to the VideoSDK meeting room, handling all the complex real-time communication protocols. -
GeminiRealtime
: This plugin configures the connection to Google's Gemini API, including the model, voice, and response parameters. -
BackgroundTasks
: Thesession.start()
method is a blocking call that runs as long as the agent is in the meeting. We use FastAPI'sBackgroundTasks
to run it without freezing our API, allowing us to immediately return a response to the frontend. -
active_sessions
: This dictionary is a simple way to keep track of running sessions. This allows our/leave-agent
endpoint to find and gracefully shut down the correct agent.
Frontend Setup
Now let's build the user interface where we can talk to our agent.
1. Create a new React + TypeScript project
Navigate to your project root and use Vite to scaffold a new app.
# From the root /gemini-voice-agent directory
npm create vite@latest client -- --template react-ts
cd client
npm install
2. Install dependencies
We need the VideoSDK React SDK for meeting controls, lucide-react
for icons, and TailwindCSS for styling.
npm install @videosdk.live/react-sdk lucide-react tailwindcss postcss autoprefixer
npx tailwindcss init -p
3. Configure Tailwind CSS
Update tailwind.config.js
to tell Tailwind which files to scan for classes.
// client/tailwind.config.js
/** @type {import('tailwindcss').Config} */
export default {
content: [
"./index.html",
"./src/**/*.{js,ts,jsx,tsx}",
],
theme: {
extend: {},
},
plugins: [],
}
Then, add the Tailwind directives to your main CSS file.
/* client/src/index.css */
@tailwind base;
@tailwind components;
@tailwind utilities;
4. Create a frontend .env
file
In the client
directory, create a .env
file for your client-side environment variables.
# client/.env
VITE_VIDEOSDK_TOKEN=your_videosdk_auth_token_here
VITE_API_URL=http://localhost:8000
Note: Vite requires environment variables exposed to the browser to be prefixed with
VITE_
.
5. Build the React User Interface
Replace the contents of client/src/App.tsx
with the following code. This component will handle creating a meeting, joining it, inviting the agent, and playing the agent's audio.
// client/src/App.tsx
import React, { useEffect, useMemo, useRef, useState } from "react";
import {
MeetingProvider,
useMeeting,
useParticipant,
} from "@videosdk.live/react-sdk";
import { Mic, MicOff } from "lucide-react";
// Component to play the agent's audio
const AgentAudioPlayer = ({ participantId }: { participantId: string }) => {
const { micStream, isMicOn } = useParticipant(participantId);
const audioRef = useRef<HTMLAudioElement>(null);
useEffect(() => {
if (audioRef.current && micStream) {
const mediaStream = new MediaStream();
mediaStream.addTrack(micStream.track);
audioRef.current.srcObject = mediaStream;
audioRef.current.play().catch((err) => {
console.error("Error playing audio:", err);
});
}
}, [micStream]);
return <>{isMicOn && <audio ref={audioRef} autoPlay style={{ display: "none" }} />}</>;
};
// Main meeting view component
const MeetingView = ({ meetingId, onLeave }: { meetingId: string, onLeave: () => void }) => {
const { join, leave, toggleMic, localParticipant, participants } = useMeeting();
const agentParticipant = useMemo(
() => [...participants.values()].find(p => p.displayName === "Gemini Agent"),
[participants]
);
useEffect(() => {
// Call the backend to invite the agent when the user joins
fetch(`${import.meta.env.VITE_API_URL}/join-agent`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
meeting_id: meetingId,
token: import.meta.env.VITE_VIDEOSDK_TOKEN,
// Agent Configuration - you can customize this
model: "gemini-1.5-pro-latest",
voice: "en-US-Standard-C",
personality: "friendly",
temperature: 0.9,
system_prompt: "You are a helpful AI assistant. Keep your responses concise and to the point.",
topP: 1,
topK: 40,
}),
});
return () => {
// Call the backend to make the agent leave when the component unmounts
fetch(`${import.meta.env.VITE_API_URL}/leave-agent`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ meeting_id: meetingId }),
});
};
}, [meetingId]);
return (
<div className="flex flex-col items-center justify-center h-full p-4">
<h2 className="text-2xl font-bold mb-4">Meeting ID: {meetingId}</h2>
<div className="flex items-center justify-center space-x-4">
<div className="flex flex-col items-center">
<div className="w-24 h-24 bg-gray-700 rounded-full flex items-center justify-center">
<p className="text-3xl">You</p>
</div>
<p>{localParticipant.displayName}</p>
</div>
{agentParticipant ? (
<div className="flex flex-col items-center">
<div className="w-24 h-24 bg-blue-600 rounded-full flex items-center justify-center">
<p className="text-3xl">AI</p>
</div>
<p>{agentParticipant.displayName}</p>
<AgentAudioPlayer participantId={agentParticipant.id} />
</div>
) : (
<div className="flex flex-col items-center">
<div className="w-24 h-24 bg-gray-700 rounded-full flex items-center justify-center animate-pulse">
<p className="text-3xl">AI</p>
</div>
<p>Agent is joining...</p>
</div>
)}
</div>
<div className="mt-8 flex space-x-4">
<button onClick={() => toggleMic()} className="p-3 bg-gray-600 rounded-full">
{localParticipant.isMicOn ? <Mic /> : <MicOff />}
</button>
<button onClick={() => {
leave();
onLeave();
}} className="px-4 py-2 bg-red-500 rounded-lg">Leave</button>
</div>
</div>
);
};
// Main App component
function App() {
const [meetingId, setMeetingId] = useState<string | null>(null);
const createMeeting = async () => {
try {
const res = await fetch(`https://api.videosdk.live/v2/rooms`, {
method: "POST",
headers: {
authorization: `${import.meta.env.VITE_VIDEOSDK_TOKEN}`,
"Content-Type": "application/json",
},
body: JSON.stringify({}),
});
const { roomId } = await res.json();
setMeetingId(roomId);
} catch (error) {
console.error("Error creating meeting:", error);
alert("Failed to create meeting.");
}
};
const onMeetingLeave = () => {
setMeetingId(null);
}
return meetingId ? (
<MeetingProvider
config={{
meetingId,
micEnabled: true,
webcamEnabled: false,
name: "User",
token: import.meta.env.VITE_VIDEOSDK_TOKEN,
}}
joinWithoutUserInteraction
>
<MeetingView meetingId={meetingId} onLeave={onMeetingLeave} />
</MeetingProvider>
) : (
<div className="h-screen w-screen flex flex-col items-center justify-center bg-gray-800 text-white">
<h1 className="text-4xl font-bold mb-8">Gemini AI Voice Agent</h1>
<button onClick={createMeeting} className="px-6 py-3 bg-blue-600 rounded-lg text-xl">
Start a Conversation
</button>
</div>
);
}
export default App;
Breaking Down the Frontend
-
MeetingProvider
: This is the top-level wrapper from the VideoSDK React SDK. It provides the meeting context to all child components. -
useMeeting()
hook: This powerful hook gives us access to all essential meeting functions likejoin
,leave
,toggleMic
, and the list ofparticipants
. -
useParticipant()
hook: This hook provides real-time information about a specific participant, including theirmicStream
, which contains the raw audio data. -
AgentAudioPlayer
Component: This small component is crucial. It takes the agent'sparticipantId
, gets themicStream
using theuseParticipant
hook, and pipes it into a standard HTML<audio>
element to be played. - API Calls: When the user joins the meeting, a
useEffect
hook fires aPOST
request to our/join-agent
backend endpoint. The cleanup function of theuseEffect
fires when the user leaves, calling the/leave-agent
endpoint to ensure the agent is removed from the call and server resources are freed.
๐ Run the App
It's time to see our creation in action! You'll need two separate terminal windows.
1. Start the Backend Server
In your first terminal, at the project root:
# Make sure your virtual environment is active
source venv/bin/activate
# Start the server
uvicorn server:app --host 0.0.0.0 --port 8000 --reload
2. Start the Frontend App
In your second terminal, navigate to the client
directory:
cd client
npm run dev
Now, open your browser and go to http://localhost:5173
. Click "Start a Conversation," allow microphone permissions, and start talking to your very own AI agent!
Conclusion
Congratulations! You've successfully built a fully functional, real-time AI voice agent using Google Gemini and VideoSDK. You've learned how to:
- Set up a Python backend to manage an AI agent.
- Connect to Google's Gemini Realtime API for conversational AI.
- Use VideoSDK to handle real-time audio streaming and session management.
- Build a React frontend to interact with the agent in a browser.
This is just the beginning. You can now customize the agent's system prompt, personality, and even give it new tools and capabilities.
If you build something cool with this, we'd love to see it. Share it on X/Twitter and tag @video_sdk!
Happy hacking! ๐
Top comments (0)