Chaitrali Kakde

Posted on Oct 28

Create your own Talking AI Avatar Agent 🫨

#webdev #ai #avatar #agents

You can Create your own Talking AI Avatar Agent. AI voice agents are great at talking but not at connecting. Voice alone can’t express emotion, empathy, or trust. In this blog, we’ll explore how you can give your AI a face and personality using VideoSDK’s Simli Avatar integration to make every interaction more lifelike and engaging.

Let’s answer what most builders are wondering:

How can I make my Avatar talk like humans?
What are avatars, and how do they actually work with voice agents?
How do I integrate an avatar into my VideoSDK pipeline?
What are the best practices for creating realistic, reliable avatars?

What is an AI Avatar?

An AI avatar is a real-time visual representation of your voice-based AI agent, showing facial expressions, and mimicking natural movement. Using tools like Simli, avatars render in real time, giving your AI a relatable and human-like presence.

How it Works

The user provides voice input, which enters a pipeline.
Depending on the pipeline type, it follows one of two paths:
Realtime Pipeline – uses a Speech-to-Speech model for instant response.
Cascading Pipeline – processes through multiple stages: STT (Speech-to-Text) → LLM (Language Model) → VAD (Voice Activity Detection) → TTS (Text-to-Speech).
The output is audio, which is sent to the avatar for playback.
Through a VideoSDK Room, the user and avatar communicate via an audio/video stream, enabling interactive, real-time conversations.

Let's create an talking AI Avatar

Project structure

├── main.py              # Main agent implementation
├── requirements.txt     # Python dependencies
├── mcp_joke.py       # Weather MCP server
├── .env.example         # Environment variables template
└── README.md            # This file

Pre-requisites

Make sure you've a python >=3.12
Simli Api key (dashboard link)
VideoSDK Auth Token (token)

Create a .env file

VIDEOSDK_AUTH_TOKEN = ""
SIMLI_API_KEY = ""
GOOGLE_API_KEY = ""

Create and Activate the Virtual Environment

python -m venv .venv
# On Windows
.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activate

Install all these dependencies

pip install videosdk-agents videosdk-plugins-google videosdk-plugins-simli python-dotenv fastmcp

Create a main.py file

import asyncio
import sys
from pathlib import Path
import requests
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob, MCPServerStdio
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.plugins.simli import SimliAvatar, SimliConfig
from dotenv import load_dotenv
import os

load_dotenv(override=True)

def get_room_id(auth_token: str) -> str:
    url = "https://api.videosdk.live/v2/rooms"
    headers = {
        "Authorization": auth_token
    }
    response = requests.post(url, headers=headers)
    response.raise_for_status()
    return response.json()["roomId"]

class MyVoiceAgent(Agent):
    def __init__(self):
        mcp_script_weather = Path(__file__).parent / "mcp_joke.py"
        super().__init__(
            instructions="You are VideoSDK's AI Avatar Voice Agent with real-time capabilities. You are a helpful virtual assistant with a visual avatar that can answer questions about weather help with other tasks in real-time.",
            mcp_servers = [
                MCPServerStdio(
                    executable_path=sys.executable,
                    process_arguments= [str(mcp_script_weather)],
                    session_timeout=30
                )
                ]
        )

    async def on_enter(self) -> None:
        await self.session.say("Hello! I'm your real-time AI avatar assistant. How can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye! It was great talking with you!")


async def start_session(context: JobContext):
    # Initialize Gemini Realtime model
    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        api_key="xxxxxx", 
        config=GeminiLiveConfig(
            voice="Leda",  # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    # Initialize Simli Avatar
    simli_config = SimliConfig(
        apiKey="xxxxxxxxxxxxx",
        faceId="0c2b8b04-5274-41f1-a21c-d5c98322efa9" # default
    )
    simli_avatar = SimliAvatar(config=simli_config)

    # Create pipeline with avatar
    pipeline = RealTimePipeline(
        model=model,
        avatar=simli_avatar
    )

    session = AgentSession(
        agent=MyVoiceAgent(),
        pipeline=pipeline
    )

    try:
        await context.connect()
        await session.start()
        await asyncio.Event().wait()
    finally:
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    auth_token = os.getenv("VIDEOSDK_AUTH_TOKEN")
    room_id = get_room_id(auth_token)
    room_options = RoomOptions(
        room_id=room_id,
        auth_token=auth_token,
        name="Simli Avatar Realtime Agent",
        playground=True 
    )
    return JobContext(room_options=room_options)


if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Create a mcp_joke file

from fastmcp import FastMCP
import httpx

mcp = FastMCP("JokeServer")

@mcp.tool()
async def get_random_joke() -> str:
    """
    Fetch a random joke from the Official Joke API and format it for voice response.
    """
    JOKE_API_URL = "https://official-joke-api.appspot.com/random_joke"

    async with httpx.AsyncClient() as client:
        try:
            response = await client.get(JOKE_API_URL, timeout=10)
            response.raise_for_status()
            joke_data = response.json()

            setup = joke_data.get("setup", "Hmm... I forgot the joke setup!")
            punchline = joke_data.get("punchline", "Oh wait, I forgot the punchline too!")

            # Voice-friendly response (add pauses for TTS)
            return f"Here's a joke for you! {setup} ... {punchline}"

        except httpx.RequestError as e:
            return f"Oops! I couldn’t fetch a joke right now. Network error: {e}"
        except Exception as e:
            return f"Something went wrong while getting a joke: {e}"

if __name__ == "__main__":
    mcp.run(transport="stdio")

Run your agent

python main.py

You can dive deeper into the playground and agent capabilities in the VideoSDK AI Playground documentation.
Implement cascading pipeline
Read more about AI Agents here

We’d love to hear from you!

Have you tried creating your own AI-powered Simli Avatar using the IoT SDK?
What challenges did you face while integrating real-time voice and motion on your device?
Are you more excited about building expressive AI companions or functional IoT assistants?
How do you see voice-interactive avatars and devices changing the way people connect, learn, and play?

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

DEV Community