DEV Community

Mart Schweiger
Mart Schweiger

Posted on • Originally published at assemblyai.com

Build a Voice Agent with LiveKit

What is LiveKit?

LiveKit is an open-source real-time communication platform built on WebRTC infrastructure. It handles signaling, media routing, and scaling challenges so developers can focus on application logic rather than media plumbing.

LiveKit Agents is the framework layer specifically designed for AI-powered voice and video agents. It manages orchestration between speech-to-text, language model, and text-to-speech services.

The Voice Agent Pipeline

Voice agents follow a three-step cascade:

  1. Speech-to-text (STT): User speech converts to transcript in real time
  2. LLM: Transcript generates language model response
  3. Text-to-speech (TTS): Response converts back to audio

Flow: WebRTC → LiveKit Cloud → AssemblyAI Universal-3 Pro Streaming → OpenAI GPT-4o → Cartesia Sonic → Back to LiveKit room

Why Universal-3 Pro Streaming for STT?

Accuracy

Production benchmarks from Hamming.ai across 4M+ production calls:

Metric Universal-3 Pro Streaming Deepgram Nova-3
P50 latency 307 ms 516 ms
P99 latency 1,012 ms 1,907 ms
Word error rate 8.14% 9.87%
Alphanumeric accuracy +21% fewer errors baseline

Neural Turn Detection

Most models use voice activity detection (VAD) — silence-based turn endings that trigger on mid-sentence pauses. Universal-3 Pro Streaming uses neural turn detection, combining acoustic and linguistic signals to distinguish mid-sentence breathing from actual end-of-turn moments.

Result: Faster response times, fewer false triggers, more natural conversations. Costs $0.45/hour and supports six languages.

Prerequisites

  • Python 3.11+
  • Microphone and speakers (for local testing)
  • API keys for: AssemblyAI, LiveKit Cloud, OpenAI, Cartesia

Step 1: Installation

python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate

pip install "livekit-agents[assemblyai,silero,codecs]~=1.0" python-dotenv
pip install "livekit-agents[openai,cartesia]~=1.0"
Enter fullscreen mode Exit fullscreen mode

Note: Universal-3 Pro Streaming support requires livekit-agents@1.4.4 or newer.

Step 2: Configure API Keys

LIVEKIT_URL=wss://your-project.livekit.cloud
LIVEKIT_API_KEY=your_livekit_api_key
LIVEKIT_API_SECRET=your_livekit_api_secret
ASSEMBLYAI_API_KEY=your_assemblyai_key
OPENAI_API_KEY=your_openai_key
CARTESIA_API_KEY=your_cartesia_key
Enter fullscreen mode Exit fullscreen mode

Step 3: Build the Agent

Create agent.py:

from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentSession, Agent
from livekit.plugins import (
    assemblyai,
    cartesia,
    openai,
    silero,
)

load_dotenv()


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(instructions="You are a helpful voice AI assistant.")


async def entrypoint(ctx: agents.JobContext):
    await ctx.connect()

    session = AgentSession(
        stt=assemblyai.STT(
            model="u3-rt-pro",
            min_turn_silence=100,
            max_turn_silence=1000,
            vad_threshold=0.3,
        ),
        llm=openai.LLM(model="gpt-4o"),
        tts=cartesia.TTS(),
        vad=silero.VAD.load(
            activation_threshold=0.3,
        ),
        turn_detection="stt",
        min_endpointing_delay=0,
    )

    await session.start(
        room=ctx.room,
        agent=Assistant(),
    )

    await session.generate_reply(
        instructions="Greet the user and offer your assistance."
    )


if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
Enter fullscreen mode Exit fullscreen mode

Component Breakdown

turn_detection="stt" — Uses Universal-3 Pro Streaming's neural turn detection instead of LiveKit's default detector. Requires min_endpointing_delay=0 to avoid stacking delay.

Step 4: Run It

Console Mode (No LiveKit connection)

python agent.py console
Enter fullscreen mode Exit fullscreen mode

Dev Mode (LiveKit Cloud connection)

python agent.py dev
Enter fullscreen mode Exit fullscreen mode

Open agents-playground.livekit.io, enter your LiveKit credentials, and converse through the browser.

Tuning Turn Detection

stt=assemblyai.STT(
    model="u3-rt-pro",
    end_of_turn_confidence_threshold=0.4,  # 0.0-1.0
    min_turn_silence=300,                  # milliseconds
    max_turn_silence=1200,                 # milliseconds
)
Enter fullscreen mode Exit fullscreen mode
  • end_of_turn_confidence_threshold: Lower = faster response; higher = fewer false triggers. Use 0.6 for noisy call centers.
  • min_turn_silence: Lower to 200 for rapid back-and-forth; keep at 300 for general conversation.
  • max_turn_silence: Raise to 2000 for healthcare or deliberate speakers.

Mid-session updates work without reconnecting:

await session.stt.update_options(max_turn_silence=3000)
Enter fullscreen mode Exit fullscreen mode

Enabling Keyterm Prompting

Boost recognition accuracy for domain-specific vocabulary:

await session.stt.update_options(
    keyterms_prompt=["AssemblyAI", "Universal-3 Pro", "LiveKit"]
)
Enter fullscreen mode Exit fullscreen mode

Supports up to 1,000 terms, each up to 50 characters. Takes effect immediately without restart.

Swapping Components

Swap LLM to Claude:

from livekit.plugins import anthropic
session = AgentSession(
    llm=anthropic.LLM(model="claude-sonnet-4-5"),
    ...
)
Enter fullscreen mode Exit fullscreen mode

Swap TTS to ElevenLabs:

from livekit.plugins import elevenlabs
tts=elevenlabs.TTS(voice_id="your-voice-id")
Enter fullscreen mode Exit fullscreen mode

Next Steps

  • Deploy to LiveKit Cloud: Run python agent.py start for persistent worker deployment
  • Add tool calling: Enable function calling through the LLM layer for lookups and actions
  • Enable speaker diarization: Real-time speaker identification for multi-party conversations
  • Add telephony: SIP support via Telnyx or Twilio integration

FAQs

What is LiveKit Agents?
LiveKit Agents is a framework for building AI-powered voice and video agents. It orchestrates between STT, LLM, and TTS components, handling real-time audio routing and turn flow while you define agent logic.

Universal-3 Pro Streaming vs. Universal-Streaming?
Universal-3 Pro Streaming (u3-rt-pro) targets production voice agent workflows with neural turn detection and superior accuracy on structured entities. Universal-Streaming is faster and cheaper ($0.15/hr) but English-only and VAD-based.

Best STT API for LiveKit voice agents?
AssemblyAI's Universal-3 Pro Streaming. Native one-line integration, 307ms P50 latency, neural turn detection, and 8.14% word error rate across production benchmarks.

How does neural turn detection work?
Combines acoustic signals (voice energy, pitch, cadence) and linguistic signals (sentence completion, punctuation patterns) to determine turn end — not just silence-based detection.

Pricing?
AssemblyAI Universal-3 Pro Streaming: $0.45/hour. Budget separately for LLM (OpenAI) and TTS (Cartesia) providers.

Top comments (0)