DEV Community

Cover image for How to build an AI Voice Agent with PaplaAI Text to speech Model
Chaitrali Kakde
Chaitrali Kakde

Posted on • Edited on

How to build an AI Voice Agent with PaplaAI Text to speech Model

Text-to-Speech (TTS) plays a crucial role in modern AI agents, especially those operating in telephony, customer support, voice bots, and conversational interfaces. If your agent needs to “talk,” you need a reliable and natural-sounding TTS system.

In this guide, we’ll explore how to integrate Papla Media’s TTS engine into the VideoSDK Agent Framework to generate smooth, high-quality speech responses from your AI agent.

Whether you're building voice assistants, IVR flows, or real-time conversational bots, Papla TTS is a strong addition to your pipeline.

Why Papla Media TTS?

Papla Media provides fast, high-quality TTS with:

  • Natural, expressive voices
  • Quick response time : ideal for real-time interactions
  • Simple configuration & flexible model selection
  • Seamless integration with VideoSDK Agent Pipelines

This makes it great for telephony agents, WhatsApp voice interactions, and AI-driven customer workflows.

Getting Started

1) Pre-requisites

  1. PaplaTTS doesnot support python3.13 use python version < 3.13
  2. VideoSDK account to get your VIDEOSDK_TOKEN.
  3. Videosdk meeting ID

2) Project Setup

Create a project folder with the following structure:

├── main.py          # Core logic of your AI agent
├── requirements.txt # Python dependencies
└── .env             # Store your API keys
Enter fullscreen mode Exit fullscreen mode

3) Create and activate your virtual environment

for macOs

python3.12 -m venv venv
source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

for windows

python -m venv venv
venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode

4) Install all the dependencies

VideoSDK provides Papla support as a separate plugin:

pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector,papla]"
Enter fullscreen mode Exit fullscreen mode

5) Import the PaplaTTS Module

The plugin exposes the PaplaTTS class, which we can attach to a CascadingPipeline.

from videosdk.plugins.papla import PaplaTTS
from videosdk.agents import CascadingPipeline
Enter fullscreen mode Exit fullscreen mode

6) Set Up Authentication

Papla requires an API key. You can generate this from your Papla Media dashboard.

Add it to your .env:

PAPLA_API_KEY=your-papla-media-api-key
DEEPGRAM_API_KEY = "Yor Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"
Enter fullscreen mode Exit fullscreen mode

API Keys - Get API keys Papla ↗OpenAI ↗ElevenLabs ↗ & VideoSDK Dashboard ↗ follow to guide to generate videosdk token

Integrating Papla TTS into Your Agent

The integration is extremely straightforward.

Initialize the Papla TTS Engine

tts = PaplaTTS(
    # When PAPLA_API_KEY exists in .env, remove this parameter
    api_key="your-papla-media-api-key",
)
Enter fullscreen mode Exit fullscreen mode

Add TTS to the Agent Pipeline

pipeline = CascadingPipeline(
    tts=tts
)
Enter fullscreen mode Exit fullscreen mode

Your agent pipeline now supports Papla TTS.

Every text response generated by your LLM will be passed to Papla Media and converted into speech before being sent to the user.

Papla TTS Configuration Options

You can customize the TTS behavior using the following parameters:

Parameter Type Default Description
model_id str "papla_p1" Selects the TTS model/voice
api_key str Your Papla API key (optional if using .env)
base_url str "https://api.papla.media/v1" Use only if you're pointing to a custom API endpoint

How Papla Fits in the CascadingPipeline

VideoSDK’s CascadingPipeline processes every message/event in a structured flow:

import asyncio, os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILL
from typing import AsyncIterator
from videosdk.plugins.papla import PaplaTTS

## Pre-downloading the Turn Detector model
pre_download_model()

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
    async def on_enter(self): await self.session.say("Hello! How can I help?")
    async def on_exit(self): await self.session.say("Goodbye!")
async def start_session(context: JobContext):
    # Create agent and conversation flow
    agent = MyVoiceAgent()
    conversation_flow = ConversationFlow(agent)

    # Create pipeline
    pipeline = CascadingPipeline(
        stt=DeepgramSTT(model="nova-2", language="en"),
        llm=OpenAILLM(model="gpt-4o"),
        tts=PaplaTTS(),
        vad=SileroVAD(threshold=0.35),
        turn_detector=TurnDetector(threshold=0.8)
    )

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow
    )

    try:
        await context.connect()
        await session.start()
        # Keep the session running until manually terminated
        await asyncio.Event().wait()
    finally:
        # Clean up resources when done
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
     #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
        name="VideoSDK Cascaded Agent",
        playground=True
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()
Enter fullscreen mode Exit fullscreen mode

Run your file:

python main.py
Enter fullscreen mode Exit fullscreen mode

you can run in console also

python main.py console
Enter fullscreen mode Exit fullscreen mode

Conclusion

Papla Media TTS is a powerful, fast, and easy-to-integrate solution for generating natural speech inside VideoSDK’s Agent Framework. With just a few lines of code, your agent can transform text responses into lifelike audio perfect for telephony and other voice-first use cases.

If you're building conversational AI agents, this integration is one of the simplest ways to add high-quality TTS into your workflow.

Resources and Next Step

  1. Our Open-source framework for building real-time multimodal conversational AI agents : https://github.com/videosdk-live/agents
  2. Build telephony agent using videosdk : https://docs.videosdk.live/ai_agents/ai-phone-agent-quick-start
  3. Build whatsapp agent using paplaAI : https://docs.videosdk.live/ai_agents/whatsapp-voice-agent-quick-start

💡 We’d love to hear from you!

  • Did you manage to set up your first AI voice agent in Python?
  • What challenges did you face while integrating cascading pipeline?
  • Are you more interested in cascading pipeline or realtime pipeline?
  • How do you see AI voice assistants transforming customer experience in your business?

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Top comments (0)