Tired of voice assistants that take forever to reply, making the chat feel slow and robotic? The trick to building a fast, natural AI voice agent is simple don’t depend on one provider for everything.
In this guide, I’ll walk you through how I built a voice agent using a cascading architecture, where each part of the STT → LLM → TTS pipeline works together to create smooth, human-like conversations.
Here’s what I used:
- Deepgram for Speech-to-Text (STT) - converting voice into text
- OpenAI for the LLM (Large Language Model) - understanding and generating replies
- ElevenLabs for Text-to-Speech (TTS) - turning responses back into realistic voice
Each component plays its part in the pipeline. When connected in a cascading flow, they create a fast, flexible, and natural-sounding voice assistant.
By the end of this post, you’ll understand how a cascading architecture voice agent works, what tools to use, and how to connect the STT, LLM, and TTS components into one smooth AI experience.
Overview: How the Cascading System Works
The output of one step immediately drops down (cascades) to the next. For the agent to feel natural, this entire process must happen almost instantly (ideally under 1.5 seconds).
1. Speech-to-Text (STT)
Converts speech to text as you speak. Use streaming STT with turn detection for speed and accuracy.
2. Large Language Model (LLM)
Generates responses in real-time, streaming words to TTS immediately.
3. Text-to-Speech (TTS)
Turns text into a human-like voice instantly, with fast playback and natural tone.
4. For turn detection
we will use VideoSDK’s specialized Namo Turn Detector model. This component is essential for determining the precise moment a user has finished speaking, ensuring the agent doesn't interrupt or pause unnecessarily.
read more about namo turn detection
How to make cascading pipeline
Installation Prerequisites
Before you begin, ensure you have:
- A VideoSDK authentication token (generate from app.videosdk.live), follow to guide to generate videosdk token
- A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
- Python 3.12 or higher
💥 create virtual environment
for windows
python -m venv venv
venv\Scripts\activate
for macOs
python3.12 -m venv venv
source venv/bin/activate
Install all dependencies
pip install "videosdk-agents[deepgram,openai,elevenlabs,silero]"
Want to use a different provider? Check out our plugins for STT, LLM, and TTS.
Plugin Installation
Install additional plugins as needed:
# Install specific provider plugins
pip install videosdk-plugins-openai
pip install videosdk-plugins-elevenlabs
pip install videosdk-plugins-deepgram
# Install namo turn detection model
pip install "videosdk-plugins-turn-detector"
Environment setup
DEEPGRAM_API_KEY = "Your Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
ELEVENLABS_API_KEY = "Your ElevenLabs API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"
API Keys - Get API keys Deepgram ↗, OpenAI ↗, ElevenLabs ↗ & VideoSDK Dashboard ↗ follow to guide to generate videosdk token
Creating our AI Voice Agent
create a main.py file
import asyncio, os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from typing import AsyncIterator
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model
from videosdk.agents import CascadingPipeline
# Pre-download the English model to avoid delays
pre_download_namo_turn_v1_model(language="en")
# Initialize the Turn Detector for English
turn_detector = NamoTurnDetectorV1(
language="en",
threshold=0.7
)
class MyVoiceAgent(Agent):
def __init__(self):
super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
async def on_enter(self): await self.session.say("Hello! How can I help?")
async def on_exit(self): await self.session.say("Goodbye!")
async def start_session(context: JobContext):
# Create agent and conversation flow
agent = MyVoiceAgent()
conversation_flow = ConversationFlow(agent)
# Create pipeline
pipeline = CascadingPipeline(
stt=DeepgramSTT(model="nova-2", language="en"),
llm=OpenAILLM(model="gpt-4o"),
tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
vad=SileroVAD(threshold=0.35),
turn_detector=turn_detector # Add the Turn Detector to a cascading pipeline
)
session = AgentSession(
agent=agent,
pipeline=pipeline,
conversation_flow=conversation_flow
)
try:
await context.connect()
await session.start()
# Keep the session running until manually terminated
await asyncio.Event().wait()
finally:
# Clean up resources when done
await session.close()
await context.shutdown()
def make_context() -> JobContext:
room_options = RoomOptions(
# room_id="YOUR_MEETING_ID", # Set to join a pre-created room; omit to auto-create
name="VideoSDK Cascaded Agent",
playground=True
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()
Get started quickly with the Quick Start Example for the VideoSDK AI Agent SDK everything you need to build your first AI agent fast.
You've now got the blueprint for building a voice agent that doesn't just talk, but responds instantly. By demanding streaming from your STT, LLM, and TTS providers and carefully managing the flow with the Turn Detection logic, you bypass the common lag issues that plague most voice assistants. This best-of-breed, cascading approach puts you in control, allowing you to future-proof your agent by swapping out a component (like upgrading your LLM) without rebuilding the entire system.
💡 We’d love to hear from you!
- Did you manage to set up your first AI voice agent in Python?
- What challenges did you face while integrating cascading pipeline?
- Are you more interested in cascading pipeline or realtime pipeline?
- How do you see AI voice assistants transforming customer experience in your business?
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!
Top comments (0)