DEV Community

Cover image for How I Built a Custom AI Voice Agent by Combining Deepgram, OpenAI, and ElevenLabs
Chaitrali Kakde
Chaitrali Kakde

Posted on

How I Built a Custom AI Voice Agent by Combining Deepgram, OpenAI, and ElevenLabs

Tired of voice assistants that take forever to reply, making the chat feel slow and robotic? The trick to building a fast, natural AI voice agent is simple don’t depend on one provider for everything.

In this guide, I’ll walk you through how I built a voice agent using a cascading architecture, where each part of the STT → LLM → TTS pipeline works together to create smooth, human-like conversations.

Here’s what I used:

  • Deepgram for Speech-to-Text (STT) - converting voice into text
  • OpenAI for the LLM (Large Language Model) - understanding and generating replies
  • ElevenLabs for Text-to-Speech (TTS) - turning responses back into realistic voice

Each component plays its part in the pipeline. When connected in a cascading flow, they create a fast, flexible, and natural-sounding voice assistant.

By the end of this post, you’ll understand how a cascading architecture voice agent works, what tools to use, and how to connect the STT, LLM, and TTS components into one smooth AI experience.

Overview: How the Cascading System Works

The output of one step immediately drops down (cascades) to the next. For the agent to feel natural, this entire process must happen almost instantly (ideally under 1.5 seconds).

Cascading pipeline with videosdk

1. Speech-to-Text (STT)

Converts speech to text as you speak. Use streaming STT with turn detection for speed and accuracy.

2. Large Language Model (LLM)

Generates responses in real-time, streaming words to TTS immediately.

3. Text-to-Speech (TTS)

Turns text into a human-like voice instantly, with fast playback and natural tone.

4. For turn detection

we will use VideoSDK’s specialized Namo Turn Detector model. This component is essential for determining the precise moment a user has finished speaking, ensuring the agent doesn't interrupt or pause unnecessarily.

read more about namo turn detection

How to make cascading pipeline

Installation Prerequisites

Before you begin, ensure you have:

💥 create virtual environment

for windows

python -m venv venv
venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode

for macOs

python3.12 -m venv venv
source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Install all dependencies

pip install "videosdk-agents[deepgram,openai,elevenlabs,silero]"
Enter fullscreen mode Exit fullscreen mode

Want to use a different provider? Check out our plugins for STTLLM, and TTS.

Plugin Installation

Install additional plugins as needed:

# Install specific provider plugins
pip install videosdk-plugins-openai
pip install videosdk-plugins-elevenlabs
pip install videosdk-plugins-deepgram

# Install namo turn detection model
pip install "videosdk-plugins-turn-detector"
Enter fullscreen mode Exit fullscreen mode

Environment setup

DEEPGRAM_API_KEY = "Your Deepgram API Key"
OPENAI_API_KEY = "Your OpenAI API Key"
ELEVENLABS_API_KEY = "Your ElevenLabs API Key"
VIDEOSDK_AUTH_TOKEN = "VideoSDK Auth token"
Enter fullscreen mode Exit fullscreen mode

API Keys - Get API keys Deepgram ↗OpenAI ↗ElevenLabs ↗ & VideoSDK Dashboard ↗ follow to guide to generate videosdk token

Creating our AI Voice Agent

create a main.py file

import asyncio, os
from videosdk.agents import Agent, AgentSession, CascadingPipeline, JobContext, RoomOptions, WorkerJob,ConversationFlow
from videosdk.plugins.silero import SileroVAD
from videosdk.plugins.deepgram import DeepgramSTT
from videosdk.plugins.openai import OpenAILLM
from videosdk.plugins.elevenlabs import ElevenLabsTTS
from typing import AsyncIterator
from videosdk.plugins.turn_detector import NamoTurnDetectorV1, pre_download_namo_turn_v1_model
from videosdk.agents import CascadingPipeline

# Pre-download the English model to avoid delays
pre_download_namo_turn_v1_model(language="en")

# Initialize the Turn Detector for English
turn_detector = NamoTurnDetectorV1(
  language="en",
  threshold=0.7
)

Enter fullscreen mode Exit fullscreen mode
class MyVoiceAgent(Agent):
    def __init__(self):
        super().__init__(instructions="You are a helpful voice assistant that can answer questions and help with tasks.")
    async def on_enter(self): await self.session.say("Hello! How can I help?")
    async def on_exit(self): await self.session.say("Goodbye!")
async def start_session(context: JobContext):
    # Create agent and conversation flow
    agent = MyVoiceAgent()
    conversation_flow = ConversationFlow(agent)

    # Create pipeline
    pipeline = CascadingPipeline(
        stt=DeepgramSTT(model="nova-2", language="en"),
        llm=OpenAILLM(model="gpt-4o"),
        tts=ElevenLabsTTS(model="eleven_flash_v2_5"),
        vad=SileroVAD(threshold=0.35),
        turn_detector=turn_detector  # Add the Turn Detector to a cascading pipeline
    )

    session = AgentSession(
        agent=agent,
        pipeline=pipeline,
        conversation_flow=conversation_flow
    )

    try:
        await context.connect()
        await session.start()
        # Keep the session running until manually terminated
        await asyncio.Event().wait()
    finally:
        # Clean up resources when done
        await session.close()
        await context.shutdown()

def make_context() -> JobContext:
    room_options = RoomOptions(
     #  room_id="YOUR_MEETING_ID",  # Set to join a pre-created room; omit to auto-create
        name="VideoSDK Cascaded Agent",
        playground=True
    )

    return JobContext(room_options=room_options)

if __name__ == "__main__":
    job = WorkerJob(entrypoint=start_session, jobctx=make_context)
    job.start()
Enter fullscreen mode Exit fullscreen mode

Get started quickly with the Quick Start Example for the VideoSDK AI Agent SDK everything you need to build your first AI agent fast.

You've now got the blueprint for building a voice agent that doesn't just talk, but responds instantly. By demanding streaming from your STT, LLM, and TTS providers and carefully managing the flow with the Turn Detection logic, you bypass the common lag issues that plague most voice assistants. This best-of-breed, cascading approach puts you in control, allowing you to future-proof your agent by swapping out a component (like upgrading your LLM) without rebuilding the entire system.

💡 We’d love to hear from you!

  • Did you manage to set up your first AI voice agent in Python?
  • What challenges did you face while integrating cascading pipeline?
  • Are you more interested in cascading pipeline or realtime pipeline?
  • How do you see AI voice assistants transforming customer experience in your business?

👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!

Top comments (0)