Mart Schweiger

Posted on Apr 14 • Originally published at assemblyai.com

Build a Voice Agent with LiveKit

#python #voiceai #livekit #assemblyai

What is LiveKit?

LiveKit is an open-source real-time communication platform built on WebRTC infrastructure. It handles signaling, media routing, and scaling challenges so developers can focus on application logic rather than media plumbing.

LiveKit Agents is the framework layer specifically designed for AI-powered voice and video agents. It manages orchestration between speech-to-text, language model, and text-to-speech services.

The Voice Agent Pipeline

Voice agents follow a three-step cascade:

Speech-to-text (STT): User speech converts to transcript in real time
LLM: Transcript generates language model response
Text-to-speech (TTS): Response converts back to audio

Flow: WebRTC → LiveKit Cloud → AssemblyAI Universal-3 Pro Streaming → OpenAI GPT-4o → Cartesia Sonic → Back to LiveKit room

Why Universal-3 Pro Streaming for STT?

Accuracy

Production benchmarks from Hamming.ai across 4M+ production calls:

Metric	Universal-3 Pro Streaming	Deepgram Nova-3
P50 latency	307 ms	516 ms
P99 latency	1,012 ms	1,907 ms
Word error rate	8.14%	9.87%
Alphanumeric accuracy	+21% fewer errors	baseline

Neural Turn Detection

Most models use voice activity detection (VAD) — silence-based turn endings that trigger on mid-sentence pauses. Universal-3 Pro Streaming uses neural turn detection, combining acoustic and linguistic signals to distinguish mid-sentence breathing from actual end-of-turn moments.

Result: Faster response times, fewer false triggers, more natural conversations. Costs $0.45/hour and supports six languages.

Prerequisites

Python 3.11+
Microphone and speakers (for local testing)
API keys for: AssemblyAI, LiveKit Cloud, OpenAI, Cartesia

Step 1: Installation

python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate

pip install "livekit-agents[assemblyai,silero,codecs]~=1.0" python-dotenv
pip install "livekit-agents[openai,cartesia]~=1.0"

Note: Universal-3 Pro Streaming support requires livekit-agents@1.4.4 or newer.

Step 2: Configure API Keys

LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
ASSEMBLYAI_API_KEY=your_assemblyai_key
OPENAI_API_KEY=your_openai_key
CARTESIA_API_KEY=your_cartesia_key

Step 3: Build the Agent

Create agent.py:

from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentSession, Agent
from livekit.plugins import (
    assemblyai,
    cartesia,
    openai,
    silero,
)

load_dotenv()


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(instructions="You are a helpful voice AI assistant.")


async def entrypoint(ctx: agents.JobContext):
    await ctx.connect()

    session = AgentSession(
        stt=assemblyai.STT(
            model="u3-rt-pro",
            min_turn_silence=100,
            max_turn_silence=1000,
            vad_threshold=0.3,
        ),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(),
        vad=silero.VAD.load(
            activation_threshold=0.3,
        ),
        turn_detection="stt",
        min_endpointing_delay=0,
    )

    await session.start(
        room=ctx.room,
        agent=Assistant(),
    )

    await session.generate_reply(
        instructions="Greet the user and offer your assistance."
    )


if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

Component Breakdown

turn_detection="stt" — Uses Universal-3 Pro Streaming's neural turn detection instead of LiveKit's default detector. Requires min_endpointing_delay=0 to avoid stacking delay.

Step 4: Run It

Console Mode (No LiveKit connection)

python agent.py console

Dev Mode (LiveKit Cloud connection)

python agent.py dev

Open agents-playground.livekit.io, enter your LiveKit credentials, and converse through the browser.

Tuning Turn Detection

stt=assemblyai.STT(
    model="u3-rt-pro",
    end_of_turn_confidence_threshold=0.4,  # 0.0-1.0
    min_turn_silence=300,                  # milliseconds
    max_turn_silence=1200,                 # milliseconds
)

end_of_turn_confidence_threshold: Lower = faster response; higher = fewer false triggers. Use 0.6 for noisy call centers.
min_turn_silence: Lower to 200 for rapid back-and-forth; keep at 300 for general conversation.
max_turn_silence: Raise to 2000 for healthcare or deliberate speakers.

Mid-session updates work without reconnecting:

await session.stt.update_options(max_turn_silence=3000)

Enabling Keyterm Prompting

Boost recognition accuracy for domain-specific vocabulary:

await session.stt.update_options(
    keyterms_prompt=["AssemblyAI", "Universal-3 Pro", "LiveKit"]
)

Supports up to 1,000 terms, each up to 50 characters. Takes effect immediately without restart.

Swapping Components

Swap LLM to Claude:

from livekit.plugins import anthropic
session = AgentSession(
    llm=anthropic.LLM(model="claude-sonnet-4-5"),
    ...
)

Swap TTS to ElevenLabs:

from livekit.plugins import elevenlabs
tts=elevenlabs.TTS(voice_id="your-voice-id")

Next Steps

Deploy to LiveKit Cloud: Run python agent.py start for persistent worker deployment
Add tool calling: Enable function calling through the LLM layer for lookups and actions
Enable speaker diarization: Real-time speaker identification for multi-party conversations
Add telephony: SIP support via Telnyx or Twilio integration

FAQs

What is LiveKit Agents?
LiveKit Agents is a framework for building AI-powered voice and video agents. It orchestrates between STT, LLM, and TTS components, handling real-time audio routing and turn flow while you define agent logic.

Universal-3 Pro Streaming vs. Universal-Streaming?
Universal-3 Pro Streaming (u3-rt-pro) targets production voice agent workflows with neural turn detection and superior accuracy on structured entities. Universal-Streaming is faster and cheaper ($0.15/hr) but English-only and VAD-based.

Best STT API for LiveKit voice agents?
AssemblyAI's Universal-3 Pro Streaming. Native one-line integration, 307ms P50 latency, neural turn detection, and 8.14% word error rate across production benchmarks.

How does neural turn detection work?
Combines acoustic signals (voice energy, pitch, cadence) and linguistic signals (sentence completion, punctuation patterns) to determine turn end — not just silence-based detection.

Pricing?
AssemblyAI Universal-3 Pro Streaming: $0.45/hour. Budget separately for LLM (OpenAI) and TTS (Cartesia) providers.

DEV Community