Chaitrali Kakde

Posted on Nov 15 • Edited on Nov 17

How to build an AI Voice Agent with PaplaAI Text to speech Model

#ai #python #webdev #programming

Text-to-Speech (TTS) plays a crucial role in modern AI agents, especially those operating in telephony, customer support, voice bots, and conversational interfaces. If your agent needs to “talk,” you need a reliable and natural-sounding TTS system.

In this guide, we’ll explore how to integrate Papla Media’s TTS engine into the VideoSDK Agent Framework to generate smooth, high-quality speech responses from your AI agent.

Whether you're building voice assistants, IVR flows, or real-time conversational bots, Papla TTS is a strong addition to your pipeline.

Why Papla Media TTS?

Papla Media provides fast, high-quality TTS with:

Natural, expressive voices
Quick response time : ideal for real-time interactions
Simple configuration & flexible model selection
Seamless integration with VideoSDK Agent Pipelines

This makes it great for telephony agents, WhatsApp voice interactions, and AI-driven customer workflows.

Getting Started

1) Pre-requisites

PaplaTTS doesnot support python3.13 use python version < 3.13
VideoSDK account to get your VIDEOSDK_TOKEN.
Videosdk meeting ID

2) Project Setup

Create a project folder with the following structure:

├── main.py          # Core logic of your AI agent
├── requirements.txt # Python dependencies
└── .env             # Store your API keys

3) Create and activate your virtual environment

for macOs

python3.12 -m venv venv
source venv/bin/activate

for windows

python -m venv venv
venv\Scripts\activate

4) Install all the dependencies

VideoSDK provides Papla support as a separate plugin:

pip install "videosdk-agents[deepgram,openai,elevenlabs,silero,turn_detector,papla]"

5) Import the PaplaTTS Module

The plugin exposes the PaplaTTS class, which we can attach to a CascadingPipeline.

from videosdk.plugins.papla import PaplaTTS
from videosdk.agents import CascadingPipeline

6) Set Up Authentication

Papla requires an API key. You can generate this from your Papla Media dashboard.

Add it to your .env:

PAPLA_API_KEY=your-papla-media-api-key
DEEPGRAM_API_KEY = "Yor Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"

API Keys - Get API keys Papla ↗, OpenAI ↗, ElevenLabs ↗ & VideoSDK Dashboard ↗ follow to guide to generate videosdk token

Integrating Papla TTS into Your Agent

The integration is extremely straightforward.

Initialize the Papla TTS Engine

tts = PaplaTTS(
    # When PAPLA_API_KEY exists in .env, remove this parameter
    api_key="your-papla-media-api-key",
)

Add TTS to the Agent Pipeline

pipeline = CascadingPipeline(
    tts=tts
)

Your agent pipeline now supports Papla TTS.

Every text response generated by your LLM will be passed to Papla Media and converted into speech before being sent to the user.

Papla TTS Configuration Options

You can customize the TTS behavior using the following parameters:

Parameter	Type	Default	Description
`model_id`	str	`"papla_p1"`	Selects the TTS model/voice
`api_key`	str	—	Your Papla API key (optional if using `.env`)
`base_url`	str	`"https://api.papla.media/v1"`	Use only if you're pointing to a custom API endpoint

How Papla Fits in the CascadingPipeline

VideoSDK’s CascadingPipeline processes every message/event in a structured flow:

import asyncio, os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.turn_detector import TurnDetector, pre_download_model
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILL
from typing import AsyncIterator
from videosdk.plugins.papla import PaplaTTS

## Pre-downloading the Turn Detector model
pre_download_model()

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
    async def on_enter(self): await self.session.say("Hello! How can I help?")
    async def on_exit(self): await self.session.say("Goodbye!")
async def start_session(context: JobContext):
    # Create agent and conversation flow
    agent = MyVoiceAgent()
    conversation_flow = ConversationFlow(agent)

    # Create pipeline
    pipeline = CascadingPipeline(
        stt=DeepgramSTT(model="nova-2", language="en"),
        llm=OpenAILLM(model="gpt-4o"),
        tts=PaplaTTS(),
        vad=SileroVAD(threshold=0.35),
        turn_detector=TurnDetector(threshold=0.8)
    )

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow
    )

    try:
        await context.connect()
        await session.start()
        # Keep the session running until manually terminated
        await asyncio.Event().wait()
    finally:
        # Clean up resources when done
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
     #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
        name="VideoSDK Cascaded Agent",
        playground=True
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Run your file:

python main.py

you can run in console also

python main.py console

Conclusion

Papla Media TTS is a powerful, fast, and easy-to-integrate solution for generating natural speech inside VideoSDK’s Agent Framework. With just a few lines of code, your agent can transform text responses into lifelike audio perfect for telephony and other voice-first use cases.

If you're building conversational AI agents, this integration is one of the simplest ways to add high-quality TTS into your workflow.

Resources and Next Step

Our Open-source framework for building real-time multimodal conversational AI agents : https://github.com/videosdk-live/agents
Build telephony agent using videosdk : https://docs.videosdk.live/ai_agents/ai-phone-agent-quick-start
Build whatsapp agent using paplaAI : https://docs.videosdk.live/ai_agents/whatsapp-voice-agent-quick-start

💡 We’d love to hear from you!

Did you manage to set up your first AI voice agent in Python?
What challenges did you face while integrating cascading pipeline?
Are you more interested in cascading pipeline or realtime pipeline?
How do you see AI voice assistants transforming customer experience in your business?

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

DEV Community