Chaitrali Kakde

Posted on Oct 14 • Edited on Oct 28

How I Built a AI Voice Agent by Combining OpenAI, Deepgram, and ElevenLabs for free

#ai #webdev #openai #agents

Tired of voice assistants that take forever to reply, making the chat feel slow and robotic? The trick to building a fast, natural AI voice agent is simple don’t depend on one provider for everything.

In this guide, I’ll walk you through how I built a voice agent using VideoSDK, Deepgram, ElevenLabs and Open AI, where each part of the STT → LLM → TTS pipeline works together to create smooth, human-like conversations using cascading pipeline

Here’s what I used:

Deepgram for Speech-to-Text (STT) - converting voice into text
OpenAI for the LLM (Large Language Model) - understanding and generating replies
ElevenLabs for Text-to-Speech (TTS) - turning responses back into realistic voice

Each component plays its part in the pipeline. When connected in a cascading flow, they create a fast, flexible, and natural-sounding voice assistant.

By the end of this post, you’ll understand how a cascading architecture voice agent works, what tools to use, and how to connect the STT, LLM, and TTS components into one smooth AI experience.

Overview: How the Cascading System Works

The output of one step immediately drops down (cascades) to the next. For the agent to feel natural, this entire process must happen almost instantly (ideally under 1.5 seconds).

1. Speech-to-Text (STT)

Converts speech to text as you speak. Use streaming STT with turn detection for speed and accuracy.

2. Large Language Model (LLM)

Generates responses in real-time, streaming words to TTS immediately.

3. Text-to-Speech (TTS)

Turns text into a human-like voice instantly, with fast playback and natural tone.

4. For turn detection

we will use VideoSDK’s specialized Namo Turn Detector model. This component is essential for determining the precise moment a user has finished speaking, ensuring the agent doesn't interrupt or pause unnecessarily.

read more about namo turn detection

How to make cascading pipeline

Installation Prerequisites

Before you begin, ensure you have:

A VideoSDK authentication token (generate from app.videosdk.live), follow to guide to generate videosdk token)
A VideoSDK meeting ID (you can generate one using the Create Room API or through the VideoSDK dashboard)
Python 3.12 or higher

💥 create virtual environment

for windows

python -m venv venv
venv\Scripts\activate

for macOs

python3.12 -m venv venv
source venv/bin/activate

Install all dependencies

pip install "videosdk-agents[deepgram,openai,elevenlabs,silero]"

Want to use a different provider? Check out our plugins for STT, LLM, and TTS.

Plugin Installation

Install additional plugins as needed:

# Install specific provider plugins
pip install videosdk-plugins-openai
pip install videosdk-plugins-elevenlabs
pip install videosdk-plugins-deepgram

# Install namo turn detection model
pip install "videosdk-plugins-turn-detector"

Environment setup

DEEPGRAM_API_KEY = "Your Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
ELEVENLABS_API_KEY = "Your ElevenLabs API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"

API Keys - Get API keys Deepgram ↗, OpenAI ↗, ElevenLabs ↗ & VideoSDK Dashboard ↗ follow to guide to generate videosdk token

Creating our AI Voice Agent

create a main.py file

import asyncio, os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from typing import AsyncIterator
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model
from videosdk.agents import CascadingPipeline

# Pre-download the English model to avoid delays
pre_download_namo_turn_v1_model(language="en")

# Initialize the Turn Detector for English
turn_detector = NamoTurnDetectorV1(
  language="en",
  threshold=0.7
)

class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
    async def on_enter(self): await self.session.say("Hello! How can I help?")
    async def on_exit(self): await self.session.say("Goodbye!")
async def start_session(context: JobContext):
    # Create agent and conversation flow
    agent = MyVoiceAgent()
    conversation_flow = ConversationFlow(agent)

    # Create pipeline
    pipeline = CascadingPipeline(
        stt=DeepgramSTT(model="nova-2", language="en"),
        llm=OpenAILLM(model="gpt-4o"),
        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
        vad=SileroVAD(threshold=0.35),
        turn_detector=turn_detector  # Add the Turn Detector to a cascading pipeline
    )

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow
    )

    try:
        await context.connect()
        await session.start()
        # Keep the session running until manually terminated
        await asyncio.Event().wait()
    finally:
        # Clean up resources when done
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
     #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
        name="VideoSDK Cascaded Agent",
        playground=True
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()

Get started quickly with the Quick Start Example for the VideoSDK AI Agent SDK everything you need to build your first AI agent fast.

You've now got the blueprint for building a voice agent that doesn't just talk, but responds instantly. By demanding streaming from your STT, LLM, and TTS providers and carefully managing the flow with the Turn Detection logic, you bypass the common lag issues that plague most voice assistants. This best-of-breed, cascading approach puts you in control, allowing you to future-proof your agent by swapping out a component (like upgrading your LLM) without rebuilding the entire system.

Our Open-source framework for building real-time multimodal conversational AI agents : https://github.com/videosdk-live/agents

💡 We’d love to hear from you!

Did you manage to set up your first AI voice agent in Python?
What challenges did you face while integrating cascading pipeline?
Are you more interested in cascading pipeline or realtime pipeline?
How do you see AI voice assistants transforming customer experience in your business?

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Top comments (5)

Fluents • Oct 27

Nice breakdown. The cascading approach plus explicit turn detection is exactly what makes these agents feel snappy, and calling out the 1.5s target is spot on. Namo as a separate turn detector is a smart add because you can tune it independently of the STT endpointing.

A couple things that helped us reduce perceived latency in similar builds: 1) stream partials from the LLM but gate TTS on lightweight boundaries like a period or ~12–20 chars to avoid the choppy first syllable problem, 2) pre-warm and reuse WebSocket connections for Deepgram and ElevenLabs so you are not paying handshake costs on reconnects, and 3) align VAD and turn-detector thresholds so you do not ping-pong when users backchannel. With Deepgram, interim results plus endpointing tuning and punctuation on usually gives the TTS enough structure early. With ElevenLabs flash, chunk audio into ~200–300 ms pieces to keep the jitter buffer small while still sounding natural.

For barge-in, we’ve had good results with half-duplex plus soft ducking: if VAD crosses threshold for >150–200 ms, pause TTS immediately and flush any buffered text. That, plus a jitter buffer around 50–80 ms, keeps overlaps tolerable. Also worth adding a short-initial-response bias in the LLM prompt so the first turn lands quickly, then let the agent expand if the user keeps engaging.

At Fluents we run a similar cascading stack with BYOK across providers, and observability was the real unlock: per-stage timing (mic-to-STT, STT-to-first-token, first-token-to-first-audio) and 95th percentile tracking made regressions obvious. Curious what first-audio times you’re seeing in practice and how reliably you can cancel and restart TTS on interrupt within the VideoSDK pipeline. Have you experimented with dynamic Namo thresholds for noisy environments?