DEV Community

Cover image for Create your own Talking AI Avatar Agent 🫨
Chaitrali Kakde
Chaitrali Kakde

Posted on

Create your own Talking AI Avatar Agent 🫨

You can Create your own Talking AI Avatar Agent. AI voice agents are great at talking but not at connecting. Voice alone can’t express emotion, empathy, or trust. In this blog, we’ll explore how you can give your AI a face and personality using VideoSDK’s Simli Avatar integration to make every interaction more lifelike and engaging.

Let’s answer what most builders are wondering:

  • How can I make my Avatar talk like humans?
  • What are avatars, and how do they actually work with voice agents?
  • How do I integrate an avatar into my VideoSDK pipeline?
  • What are the best practices for creating realistic, reliable avatars?

What is an AI Avatar?

An AI avatar is a real-time visual representation of your voice-based AI agent, showing facial expressions, and mimicking natural movement. Using tools like Simli, avatars render in real time, giving your AI a relatable and human-like presence.

How it Works

  • The user provides voice input, which enters a pipeline.
  • Depending on the pipeline type, it follows one of two paths:
  • Realtime Pipeline – uses a Speech-to-Speech model for instant response.
  • Cascading Pipeline – processes through multiple stages: STT (Speech-to-Text) → LLM (Language Model) → VAD (Voice Activity Detection) → TTS (Text-to-Speech).
  • The output is audio, which is sent to the avatar for playback.
  • Through a VideoSDK Room, the user and avatar communicate via an audio/video stream, enabling interactive, real-time conversations.

avatar ai

Let's create an talking AI Avatar

Project structure

├── main.py              # Main agent implementation
├── requirements.txt     # Python dependencies
├── mcp_joke.py       # Weather MCP server
├── .env.example         # Environment variables template
└── README.md            # This file
Enter fullscreen mode Exit fullscreen mode

Pre-requisites

Create a .env file

VIDEOSDK_AUTH_TOKEN = ""
SIMLI_API_KEY = ""
GOOGLE_API_KEY = ""
Enter fullscreen mode Exit fullscreen mode

Create and Activate the Virtual Environment

python -m venv .venv
# On Windows
.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Install all these dependencies

pip install videosdk-agents videosdk-plugins-google videosdk-plugins-simli python-dotenv fastmcp
Enter fullscreen mode Exit fullscreen mode

Create a main.py file

import asyncio
import sys
from pathlib import Path
import requests
from videosdk.agents import Agent, AgentSession, RealTimePipeline, JobContext, RoomOptions, WorkerJob, MCPServerStdio
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
from videosdk.plugins.simli import SimliAvatar, SimliConfig
from dotenv import load_dotenv
import os

load_dotenv(override=True)

def get_room_id(auth_token: str) -> str:
    url = "https://api.videosdk.live/v2/rooms"
    headers = {
        "Authorization": auth_token
    }
    response = requests.post(url, headers=headers)
    response.raise_for_status()
    return response.json()["roomId"]

class MyVoiceAgent(Agent):
    def __init__(self):
        mcp_script_weather = Path(__file__).parent / "mcp_joke.py"
        super().__init__(
            instructions="You are VideoSDK's AI Avatar Voice Agent with real-time capabilities. You are a helpful virtual assistant with a visual avatar that can answer questions about weather help with other tasks in real-time.",
            mcp_servers = [
                MCPServerStdio(
                    executable_path=sys.executable,
                    process_arguments= [str(mcp_script_weather)],
                    session_timeout=30
                )
                ]
        )

    async def on_enter(self) -> None:
        await self.session.say("Hello! I'm your real-time AI avatar assistant. How can I help you today?")

    async def on_exit(self) -> None:
        await self.session.say("Goodbye! It was great talking with you!")


async def start_session(context: JobContext):
    # Initialize Gemini Realtime model
    model = GeminiRealtime(
        model="gemini-2.0-flash-live-001",
        # When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
        api_key="xxxxxx", 
        config=GeminiLiveConfig(
            voice="Leda",  # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
            response_modalities=["AUDIO"]
        )
    )

    # Initialize Simli Avatar
    simli_config = SimliConfig(
        apiKey="xxxxxxxxxxxxx",
        faceId="0c2b8b04-5274-41f1-a21c-d5c98322efa9" # default
    )
    simli_avatar = SimliAvatar(config=simli_config)

    # Create pipeline with avatar
    pipeline = RealTimePipeline(
        model=model,
        avatar=simli_avatar
    )

    session = AgentSession(
        agent=MyVoiceAgent(),
        pipeline=pipeline
    )

    try:
        await context.connect()
        await session.start()
        await asyncio.Event().wait()
    finally:
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    auth_token = os.getenv("VIDEOSDK_AUTH_TOKEN")
    room_id = get_room_id(auth_token)
    room_options = RoomOptions(
        room_id=room_id,
        auth_token=auth_token,
        name="Simli Avatar Realtime Agent",
        playground=True 
    )
    return JobContext(room_options=room_options)


if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start() 
Enter fullscreen mode Exit fullscreen mode

Create a mcp_joke file

from fastmcp import FastMCP
import httpx

mcp = FastMCP("JokeServer")

@mcp.tool()
async def get_random_joke() -> str:
    """
    Fetch a random joke from the Official Joke API and format it for voice response.
    """
    JOKE_API_URL = "https://official-joke-api.appspot.com/random_joke"

    async with httpx.AsyncClient() as client:
        try:
            response = await client.get(JOKE_API_URL, timeout=10)
            response.raise_for_status()
            joke_data = response.json()

            setup = joke_data.get("setup", "Hmm... I forgot the joke setup!")
            punchline = joke_data.get("punchline", "Oh wait, I forgot the punchline too!")

            # Voice-friendly response (add pauses for TTS)
            return f"Here's a joke for you! {setup} ... {punchline}"

        except httpx.RequestError as e:
            return f"Oops! I couldn’t fetch a joke right now. Network error: {e}"
        except Exception as e:
            return f"Something went wrong while getting a joke: {e}"

if __name__ == "__main__":
    mcp.run(transport="stdio")
Enter fullscreen mode Exit fullscreen mode

Run your agent

python main.py
Enter fullscreen mode Exit fullscreen mode

We’d love to hear from you!

  • Have you tried creating your own AI-powered Simli Avatar using the IoT SDK?
  • What challenges did you face while integrating real-time voice and motion on your device?
  • Are you more excited about building expressive AI companions or functional IoT assistants?
  • How do you see voice-interactive avatars and devices changing the way people connect, learn, and play?

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Top comments (0)