How to Build a AI agent using Gemini Free Models :Step by Step tutorial for beginners 🫨👏

Call it luck or skill, but this gave me the best results

The secret? VideoSDK + Gemini Live hands down the best combo for a real-time, talking AI that actually works. Forget clunky chatbots or laggy voice assistants; this setup lets your AI listen, understand, and respond instantly, just like a human.

In this post, we’ll show you step-by-step how to bring your AI to life, from setup to first conversation, so you can create your own smart, interactive agent in no time. By the end, you’ll see why this combo is a game-changer for anyone building real-time AI.

Step 1: Setting Up Your Agent Environment

Let's get started by setting up our development environment.

Prerequisites:

A VideoSDK authentication token (generate from app.videosdk.live), follow to guide to generate videosdk token
A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
A Google Cloud Project with the Gemini API enabled and an API Key (Refer to the Google Cloud documentation for setup instructions).
Python 3.12 or higher

Installation:

First, create a virtual environment and install the necessary packages:

Bash

python3 -m venv venv
source venv/bin/activate
pip install videosdk-agents videosdk-plugins-google python-dotenv

Configuration:

Create a .env file in your project root to store your API keys securely:

VIDEOSDK_AUTH_TOKEN="YOUR_VIDEOSDK_AUTH_TOKEN"
GOOGLE_API_KEY="YOUR_GEMINI_API_KEY"

Step 2: Defining the AI Pipeline with Gemini Live

The RealTimePipeline will intelligently stream audio from the VideoSDK meeting to Gemini, receive the transcribed text, pass it to the LLM for processing, and then stream the generated speech back into the meeting, all with minimal latency.

Step 3: Creating Your Conversational Agent

create a main.py file

import asyncio, os
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
    async def on_enter(self): await self.session.say("Hello! How can I help?")
    async def on_exit(self): await self.session.say("Goodbye!")
async def start_session(context: JobContext):
    # Initialize Model
    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        config=GeminiLiveConfig(
            voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    # Create pipeline
    pipeline = RealTimePipeline(
        model=model
    )

    session = AgentSession(
        agent=MyVoiceAgent(),
        pipeline=pipeline
    )

    try:
        await context.connect()
        await session.start()
        # Keep the session running until manually terminated
        await asyncio.Event().wait()
    finally:
        # Clean up resources when done
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
    #   room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
        name="VideoSDK Realtime Agent",
        playground=True
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

To Run Your Agent:

Save the complete code as main.py.
Run it from your terminal: python main.py
The script will output a VideoSDK Playground URL. Open this URL in your browser.
Join the meeting from your browser, and your Gemini Live-powered AI agent will introduce itself and be ready to converse in real-time!

Step 4: Integrate into a Live Meeting

You can take your AI agent one step further by joining it to a live meeting—just use the same meeting ID, and your agent can start interacting in real time alongside participants

Using Javascript : link
Using ReactJS: link
Using React-Native : link
Using Android : link
Using Flutter: link
Using IOS: link

Conclusion

Congrats! You’ve just built a real-time conversational AI agent using Google’s Gemini Live API and VideoSDK. This combo enables fast, natural, low-latency interactions, taking your project far beyond traditional chatbots.

Whether it’s a virtual assistant, an interactive tutor, or next-gen customer support, the possibilities are endless. The future of conversational AI is real-time, and now you have the tools to make it happen.

Our Open-source framework for building real-time multimodal conversational AI agents : https://github.com/videosdk-live/agents

💡 We’d love to hear from you!

Were you able to set up your first AI voice agent in Python?
What challenges did you face while integrating the cascading pipeline?
Are you more curious about cascading pipeline or real-time pipeline?
How do you envision AI voice assistants transforming customer experiences in your business?

👉 Share your thoughts, hurdles, or success stories in the comments, or join our Discord community ↗. We can’t wait to learn from your journey and help you build even smarter, AI-powered communication tools!

Top comments (1)

Chaitrali Kakde • Oct 21

Were you able to set up your first AI voice agent in Python?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.