DEV Community: Roman Piacquadio

How Much Does It Really Cost to Run a Voice-AI Agent at Scale?

Roman Piacquadio — Tue, 20 May 2025 17:00:13 +0000

1) Why Voice Automation Is Worth Investigating (Even If You’re Not Replacing Humans)

Voice automation has made significant progress in recent years, powered by improvements in transcription, real-time audio routing, and large language models. What was once a clunky, robotic experience is now capable of holding fluent, natural-sounding conversations with real people—across a variety of use cases.

This doesn’t mean we’re replacing human agents. Quite the opposite: automation lets us offload the repetitive, high-frequency, low-complexity tasks that tend to burn out human teams, and instead reserve human attention for edge cases, escalations, and creative problem solving.

Whether you're handling inbound customer service, qualifying outbound leads, or proactively following up on time-sensitive actions, a well-orchestrated voice-AI pipeline can free up valuable resources—if the economics make sense.

That last part is key. Many developers assume that using AI to automate voice interactions is inherently cost-effective. But is it really cheaper than staffing a team? How much do you actually pay per call once you add up every component: speech recognition, synthesis, telephony, model inference, and orchestration?

This article takes a grounded approach to that question. We'll break down a full-stack voice-AI system—from SIP trunk to final response—and price each piece out based on real-world usage. To make it concrete, we'll walk through a common use case: outbound phone calls of around 3 minutes in duration. But the same framework applies to inbound routing, reminders, surveys, or any other automated voice interaction.

Let’s dig into how it all connects—and what it really costs to make it work at scale.

2) System Architecture: How a Voice-AI Pipeline Works End-to-End

Before jumping into costs, it’s important to understand how the system is architected. At a high level, a voice-based AI interaction consists of several components working together in real time to process speech, generate responses, and keep the conversation flowing naturally.

Here’s a simplified view of the architecture:

Caller (PSTN or SIP)
⬇️
SIP Trunk (Twilio or Telnyx)
⬇️
LiveKit (Voice orchestration & media routing)
⬇️
Speech-to-Text (Deepgram)
⬇️
Language Model (OpenAI GPT-4.1 mini)
⬇️
Text-to-Speech (ElevenLabs or Cartesia)
⬇️
LiveKit (returns audio back)
⬇️
Caller

Each component has a distinct role:

SIP Trunk (Twilio or Telnyx): Bridges public phone lines to our backend system.
LiveKit: Manages the real-time audio streams and orchestrates audio I/O between services.
Deepgram: Transcribes what the user says into text.
GPT-4.1 mini: Interprets the message and generates a textual response.
Text-to-Speech (TTS): Converts the AI response into natural-sounding speech.
LiveKit (again): Sends the generated audio back to the caller, closing the loop.

If you're interested in building this kind of setup yourself, I’ve written a few posts on setting up each component individually (from SIP trunking to LiveKit integration). You can check those out here: [https://dev.to/roman_piacquadio/series/31126].

In the next section, we’ll define the assumptions behind our usage model so we can start assigning real-world costs to each of these layers.

3) Baseline Assumptions for Cost Estimation

To make the cost analysis meaningful, we need a realistic usage scenario. We’ll base our calculations on the following assumptions, which simulate a small team operating at steady scale:

📞 Call Volume

Each automated call lasts 3 minutes.
The AI system handles 100 calls per day, simulating the workload of one full-time agent.
We simulate 10 AI agents, running 22 business days per month.
That results in 22,000 calls per month, totaling 66,000 minutes of audio.

🗣️ Talk Time Distribution

In a natural two-way conversation, only one party speaks at a time. For simplicity:

The AI speaks for ~50% of the time → 33,000 minutes of TTS.
The caller (human) speaks for the other 50% → 33,000 minutes of STT.

This split is a reasonable average for structured interactions such as confirmations, reminders, qualification flows, or outbound follow-ups.

🤖 Model and Service Selection

We picked tools that balance quality, latency, and cost-effectiveness for production-level use:

LiveKit: to orchestrate real-time audio and SIP integration.
Deepgram Nova-2 (Enterprise): for fast, accurate streaming transcription with low per-minute cost.
GPT-4.1 mini: a lightweight, affordable OpenAI model that still delivers strong reasoning and fluency.
Text-to-Speech:
- ElevenLabs (Business plan): premium voices with emotional range and expressiveness.
- Cartesia (Scale plan): lower-cost alternative for simpler use cases.
SIP Trunking:
- Twilio: simple and widely used.
- Telnyx: cost-competitive with flexible routing.

This setup lets us explore the full range of pricing scenarios—from economy stacks to premium voice experiences—while keeping the core system consistent.

4) LiveKit — Orchestrating the Audio Layer

At the center of our voice pipeline is LiveKit, which handles real-time audio routing between the SIP trunk, transcription, TTS, and back to the caller. It’s the glue that makes low-latency, bidirectional communication possible.

LiveKit offers a generous Scale Plan designed for production workloads:

Base cost: $500/month
Includes: 45,000 minutes of usage
Overage rate: $0.003 per additional minute

🔢 Usage Breakdown

With 22,000 calls per month at 3 minutes each, we consume:

66,000 total minutes of LiveKit usage (audio flowing in and out).
We exceed the plan’s included minutes by 21,000 minutes.
Overage cost: 21,000 × $0.003 = $63

💰 Total Monthly Cost for LiveKit

Base plan: $500
Overage: $63
Total: $563/month

🧮 Cost per Unit

Metric	Value
Per call	$563 ÷ 22,000 = $0.0256
Per hour	20 calls/hour → 20 × $0.0256 = $0.512

LiveKit’s pricing scales linearly with usage and is well-suited for handling concurrent calls without needing to manage media servers manually.

Pricing based on public rates as of May 2025.

5) Speech-to-Text — Transcribing the Human Side with Deepgram

To understand what the caller says, we need fast and accurate transcription. For this, we use Deepgram, a popular speech-to-text (STT) provider known for real-time streaming, multilingual support, and competitive enterprise pricing.

We selected the Nova-2 model (Enterprise tier) for its balance of speed, accuracy, and affordability.

🎧 Why Streaming Matters

In a voice AI pipeline, latency is critical. Transcription must happen as the user speaks—not after they’ve finished—so the AI can respond naturally in near real-time.

Streaming STT allows:

Lower response times (smoother dialogue)
Efficient token handling for the language model
Better support for interruptions and barge-in behavior

Deepgram’s Nova-2 model supports all of this out of the box.

💰 Enterprise Pricing (Nova-2)

Rate: $0.0047 per minute (Enterprise tier, streaming)
Applicable usage: Only transcribing the human side of the call (50% of total time)
Monthly usage: 22,000 calls × 1.5 min (caller talk time) = 33,000 minutes

🧮 Cost Breakdown

Metric	Value
Per call	1.5 min × $0.0047 = $0.00705
Per hour	20 calls/hour → 20 × $0.00705 = $0.141
Total per month (10 agents)	33,000 min × $0.0047 = $155.10

This STT layer is one of the more affordable components of the pipeline, thanks to Deepgram’s usage-based pricing and efficient streaming infrastructure.

Pricing based on public Enterprise rates as of May 2025.

6) Language Model — Token Costs with GPT-4.1 mini

The core of any conversational AI system is the language model that generates responses. In our setup, we use OpenAI’s GPT-4.1 mini, which offers a great trade-off between intelligence, latency, and price.

Unlike flat-rate pricing, token billing in GPT models varies depending on:

Input tokens (the prompt + conversation history)
Output tokens (the generated reply)
Whether input tokens are cached (like a static system prompt) or non-cached

Let’s break that down.

📥 Input Tokens

Each user message builds on previous context. So as the conversation progresses, the number of input tokens increases with every new request.

For a 3-minute call with 6 back-and-forth exchanges, we estimate:

System prompt: ~2,000 tokens (sent with every request, but cacheable)
Conversation context: grows each turn (averaging 975 tokens total)
Total input tokens per call:
- Cached input: 6 × 2,000 = 12,000 tokens
- Non-cached input: ~975 tokens

📤 Output Tokens

The model responds 6 times (one per turn), with ~35 tokens per reply
Total output tokens per call: ~210 tokens

💰 GPT-4.1 mini Pricing (May 2025)

Token Type	Rate per million	Per-token cost
Cached input	$0.10 / 1M	$0.00000010
Regular input	$0.40 / 1M	$0.00000040
Output	$1.60 / 1M	$0.00000160

🧮 Cost Breakdown (Per Call)

Token Type	Tokens	Rate	Cost
Cached input	12,000	$0.10 / 1M	$0.00120
Non-cached input	975	$0.40 / 1M	$0.00039
Output	210	$1.60 / 1M	$0.000336
Total	—	—	$0.001926

📊 Aggregate Costs

Metric	Value
Per call	$0.001926
Per hour	20 × $0.001926 = $0.0385
Total per month (10 agents)	22,000 × $0.001926 = $42.37

GPT-4.1 mini’s pricing structure rewards careful prompt engineering and context management. While the per-call cost is low, the input token growth curve makes it important to minimize unnecessary repetition in multi-turn conversations (e.g. trimming old exchanges or summarizing history).

Pricing based on OpenAI’s GPT-4.1 mini public rates as of May 2025.

7) Text-to-Speech — Choosing Between ElevenLabs and Cartesia

For the AI to respond naturally, we need to convert the model’s text output into high-quality speech. In our analysis, we compared two leading Text-to-Speech (TTS) providers: ElevenLabs and Cartesia.

Both platforms deliver excellent results:

🗣️ Multilingual support
🎭 Voice cloning capabilities
⚡ Optimized for low-latency streaming

The key differences lie in pricing and voice variety.

🧬 ElevenLabs (Business Plan)

Extensive voice library, including highly expressive and emotional voices.
Well-suited for customer-facing conversations where tone and nuance matter.
Plan includes 22,000 minutes for $1,320/month.
Overage minutes cost $0.06/min.

We need 33,000 minutes per month (AI speaks ~50% of every call):

Base: $1,320
Overage: 11,000 × $0.06 = $660
Total: $1,980/month

🧪 Cartesia (Scale Plan)

Smaller voice library, but still multilingual and highly intelligible.
More cost-effective for less expressive use cases.
Estimated cost: $0.0299/min under the Scale plan.

Monthly cost for 33,000 minutes:

33,000 × $0.0299 ≈ $986.70/month

🧮 Cost Comparison

Provider	Rate	Monthly Cost	Per Call	Per Hour
ElevenLabs	$1,980/month (blended)	$1,980	$0.09	$1.80
Cartesia	$0.0299/min	$986.70	$0.0449	$0.897

🎯 Which One Should You Use?

Choose ElevenLabs if:
- You want high voice fidelity, emotional range, or public-facing use.
- You care about building brand consistency with custom voices.
Choose Cartesia if:
- You’re optimizing for cost and speed.
- Expressiveness is less critical (e.g. follow-up reminders, routing flows).

Both providers are strong technically, with low-latency streaming, voice cloning, and multilingual support out of the box.

Pricing based on public rates as of May 2025.

8) SIP Trunking — Connecting to the Phone Network (Twilio vs Telnyx)

To make and receive real phone calls, we need to connect our voice-AI system to the PSTN (Public Switched Telephone Network). This is where SIP trunking comes in. It acts as the bridge between the internet and traditional phone numbers.

In our setup, we evaluated two leading providers:

Twilio
Telnyx

Both integrate seamlessly with LiveKit, enabling bi-directional SIP call routing with support for outbound and inbound audio streams.

🔁 Understanding the Billing: Origination vs Termination

SIP trunking costs are typically split into:

Termination — outbound calls (your AI calls a user)
Origination — inbound calls (users call your AI)
Phone number rental — flat monthly rate per number

For this analysis, we assume outbound calling to U.S. local numbers (the AI initiates the conversation).

💰 Cost Comparison: Twilio vs Telnyx

Component	Twilio	Telnyx
Termination (outbound)	$0.0011/min	$0.0050/min
Origination (inbound)	$0.0034/min	$0.0035/min
Total per minute	$0.0045/min	$0.0085/min
Cost per 3-min call	$0.0135	$0.0255
Cost per hour (20 calls)	$0.27	$0.51
Monthly cost (22,000 calls)	$297.00	$561.00

Note: Phone number rental (e.g. $1.15/month for a local number) is a small fixed cost and not included here, since it’s negligible at volume.

📌 Summary

Twilio is more cost-effective at lower scale, with highly transparent pricing.
Telnyx offers flexibility, more control over routing, and competitive rates at higher volumes, especially for international calls.
Both are easy to integrate with LiveKit SIP features, making them suitable choices depending on your cost or feature preferences.

Pricing based on public SIP trunking rates as of May 2025.

Putting It All Together — Full Stack Cost Comparison

Now that we’ve broken down each component, let’s summarize the total cost of running a fully orchestrated voice AI system. We'll compare two realistic deployment stacks:

🟢 Economy Stack

TTS: Cartesia (Scale plan)
SIP Trunking: Twilio
STT: Deepgram (Nova-2 Enterprise)
LLM: GPT-4.1 mini
Audio Orchestration: LiveKit (Scale plan)

🔵 Premium Stack

TTS: ElevenLabs (Business plan + overage)
SIP Trunking: Telnyx
STT: Deepgram (Nova-2 Enterprise)
LLM: GPT-4.1 mini
Audio Orchestration: LiveKit (Scale plan)

💵 Cost Comparison Table

Component	Economy Stack	Premium Stack
LiveKit	$563.00	$563.00
STT (Deepgram)	$155.10	$155.10
LLM (GPT-4.1 mini)	$42.37	$42.37
TTS	$986.70 (Cartesia)	$1,980.00 (11Labs)
SIP Trunking	$297.00 (Twilio)	$561.00 (Telnyx)
TOTAL	$2,044.17	$3,301.47

🧮 Unit Economics

Metric	Economy Stack	Premium Stack
Per call	$0.0929	$0.1500
Per hour (20 calls)	$1.86	$3.00

🏆 Which Stack Wins?

The Economy Stack clearly offers substantial savings, making it a great choice for:

High-volume, low-complexity workflows
Prototypes or early-stage deployments
Use cases where expressive TTS is not critical

Meanwhile, the Premium Stack is ideal when:

Caller experience and vocal quality are top priorities
You need branded voices or enhanced emotional range
You're targeting sensitive, trust-critical interactions (e.g., healthcare, finance)

Both stacks are production-ready, but the Economy Stack costs ~38% less per call, making it the winner in terms of operational efficiency.

📊 Visual Overview - Cost Comparison Bar Chart (Monthly Total and Per Call)

Total Monthly Cost (USD)

Stack	Cost (USD)
Economy Stack	$2,044
Premium Stack	$3,301

Cost Per Call (USD)

Stack	Cost Per Call
Economy Stack	$0.093
Premium Stack	$0.150

Note: All prices reflect public rates as of May 2025. Each component exceeds the highest pricing tier currently listed, so enterprise-level negotiation is likely to yield 30–50% discounts when deployed at scale.
With those discounts, the Economy Stack could drop below $1,500/month, and the Premium Stack below $2,300/month, making large-scale deployment increasingly feasible.

10) Negotiating Beyond Public Pricing Tiers

At the scale we’re modeling—22,000 calls per month, totaling 66,000 minutes of voice, 33,000 minutes of TTS, and 33,000 minutes of transcription—every major component of the stack exceeds the highest publicly available pricing tier.

That includes:

LiveKit (Scale plan: 45,000 min included → we use 66,000)
Deepgram (Enterprise pricing already applies)
ElevenLabs (Business plan includes 22,000 min → we use 33,000)
Cartesia (Scale plan rates exceeded)
Twilio / Telnyx (volume usage beyond typical pay-as-you-go)
OpenAI GPT-4.1 mini (high token volume, consistent monthly usage)

🧾 Why Enterprise Negotiation Matters

When your usage becomes predictable and high-volume, vendors are often open to:

Committed-use discounts
Volume-based pricing tiers
Bundled service contracts
Custom SLAs and support

Discounts in the 30%–50% range are realistic, especially when:

You negotiate multi-month or annual commitments
You consolidate services under a single provider
You become a reference customer or provide product feedback

💸 Recalculated Costs with ~40% Discount

Applying a conservative 40% discount across the stack:

Stack Type	Full Price (Monthly)	After Discount (–40%)
Economy Stack	$2,044.17	$1,226.50
Premium Stack	$3,301.47	$1,980.88

These adjusted prices bring the cost per call down to:

Economy Stack: ~$0.056
Premium Stack: ~$0.090

And cost per hour down to:

Economy Stack: ~$1.12
Premium Stack: ~$1.80

✅ Final Takeaway

If you’re planning to scale voice-AI automation beyond a few thousand calls per month, don’t rely solely on self-serve pricing pages. Reach out to each vendor’s enterprise sales team—you may unlock significant savings that make production-scale deployment much more cost-effective than it initially appears.

All cost assumptions based on publicly available pricing as of May 2025.

11) Operational Tips & Optimizations

Once your voice-AI system is up and running, there are several strategies you can apply to reduce costs, improve performance, and make the whole experience smoother—without sacrificing quality.

Here are some of the most effective optimizations:

🧠 1. Trim the Token Window

Language model input costs scale with conversation history. Instead of sending the full transcript on every turn:

Summarize earlier turns into compact memory chunks.
Remove low-value exchanges like “OK,” “Sure,” or greetings.
Use windowing strategies (e.g., keep the last 3–4 turns only).

This helps reduce input token usage, especially in longer conversations.

🔇 2. Silence Trimming & Voice Activity Detection (VAD)

Avoid processing and transcribing empty audio:

Use Voice Activity Detection to skip long silences or background noise.
Trim pauses before sending audio to STT or TTS services.
Detect barge-ins (caller interrupts the bot) to pause TTS playback early.

This reduces billed minutes on both STT and TTS sides.

🧾 3. Cache the System Prompt

OpenAI allows cached input tokens (like a static system prompt) at a much lower rate. Make sure you:

Keep your system prompt constant across requests.
Use API options that take advantage of caching when possible.
Avoid resending unchanged instructions as raw text.

💬 4. Pre-generate Common Replies

For deterministic workflows (like confirming an appointment or collecting a yes/no), you can:

Use pre-written text responses
Skip the language model entirely for predictable branches
Cut latency and token cost to zero on those turns

📉 5. Committed-Use Agreements

Once your usage stabilizes, talk to each vendor about:

Volume discounts
Annual billing options
Custom usage tiers

Vendors are often willing to negotiate lower prices when you commit to consistent usage or bundle multiple services.

🛠️ Bonus: Monitor & Adapt in Real Time

Use analytics and observability tools (like SIP Insights, LiveKit metrics, or transcription confidence scores) to:

Spot anomalies (long silences, error spikes, dropped calls)
Optimize system behavior dynamically
Choose which interactions need human handoff

By applying even a few of these strategies, you can significantly reduce operational costs, improve response times, and deliver a more professional and polished AI voice experience.

12) Conclusion — When the Numbers Make Sense, and When the Voice Matters

Automating voice workflows isn’t about replacing people—it's about taking the repetitive, high-frequency interactions off their plates so they can focus on more meaningful work. With the right architecture and cost controls in place, voice-AI agents can handle thousands of predictable conversations efficiently and affordably.

📊 The Break-Even Point

At roughly $0.056–$0.09 per call (with enterprise pricing), you can simulate the monthly output of 10 full-time agents for $1,200–$2,000/month. Depending on your geography, staffing model, and call volume, that’s often below the cost of a single human operator.

This makes voice automation compelling for:

Lead qualification
Appointment reminders
Customer surveys
Payment follow-ups
Routine inbound routing

Especially when those interactions follow predictable patterns or scripted flows.

🔬 Where to Experiment Next

If you're considering deploying your own voice AI assistant, the next logical steps might include:

Testing real customer calls with different TTS providers
Measuring drop-off rates and call completion times
A/B testing voice styles or model temperatures
Monitoring cost per resolved interaction over time
Integrating fallback routes for complex queries (human transfer, async follow-up)

Voice automation is no longer experimental—it's becoming operational. With the right balance of cost, quality, and control, you can build something that not only saves time but feels genuinely helpful to the people on the other end of the line.

13) Resources & Links

Here’s a list of all the official pricing and documentation pages for the tools and platforms referenced throughout this article. You can refer to these for the latest rates, usage limits, and API capabilities:

🔷 LiveKit

- [LiveKit Pricing](https://livekit.io/pricing)
- [LiveKit Docs](https://docs.livekit.io/home/)

🔷 Deepgram (Speech-to-Text)

- [Deepgram Pricing](https://deepgram.com/pricing)  
- [Deepgram API Docs](https://developers.deepgram.com/home/introduction)

🔷 OpenAI (GPT-4.1 mini)

- [OpenAI Pricing](https://openai.com/api/pricing/)  
- [OpenAI API Docs](https://platform.openai.com/docs/overview)

🔷 ElevenLabs (Text-to-Speech)

- [ElevenLabs Pricing](https://elevenlabs.io/pricing/api)  
- [ElevenLabs Docs](https://elevenlabs.io/docs/overview)

🔷 Cartesia

- [Cartesia Pricing](https://cartesia.ai/pricing)  
- [Cartesia API Docs](https://docs.cartesia.ai/2024-11-13/get-started/overview)

🔷 Twilio (SIP Trunking)

- [Twilio SIP Pricing](https://www.twilio.com/en-us/sip-trunking/pricing/us)  
- [Twilio Docs](https://www.twilio.com/docs/sip-trunking)

🔷 Telnyx (SIP Trunking)

- [Telnyx SIP Pricing](https://telnyx.com/pricing/elastic-sip)  
- [Telnyx Docs](https://developers.telnyx.com/)

Cracking the < 1-second Voice Loop: What We Learned After 30+ Stack Benchmarks

Roman Piacquadio — Mon, 19 May 2025 15:09:52 +0000

Introduction — Why We’re Racing for Sub-Second Voice Loops

In October 2024 OpenAI unveiled its Realtime API, the first end-to-end multimodal model able to convert speech → text → reasoning → speech fast enough to feel human.

That launch set the hype machine spinning: “Why bother wiring three engines together when a single neural giant can do voice-to-voice in one shot?”

Reality check:

Pain Point	Real-time Voice API
Cost	~$20/hour of two-way conversation — rough for contact-center scale.
Voices	Locked to a handful of OpenAI-curated timbres; no custom cloning or branded voices.
Swapability	You wait for their next model drop — can’t plug in a brand-new STT or TTS that shipped yesterday.

Meanwhile, the open-source and vendor ecosystem didn’t sit still. By mid-2025 we could stitch together Deepgram STT + GPT-4 Nano/Mini + Cartesia Sonic (or ElevenLabs) and hit similar latency for a fraction of the cost — while choosing any voice we like.

The trick is to keep every stage modular:

Speech-to-Text (STT) — use whatever recognizer is fastest or cheapest today.
Large Language Model (LLM) — swap Mini ↔ Nano ↔ Flash checkpoints as they evolve.
Text-to-Speech (TTS) — pick the voice library that matches your brand.

Enter LiveKit

The glue that lets us shuffle those building blocks in real time is LiveKit — a WebRTC orchestration layer with an SDK that can fan-out telephone legs, browser streams, and AI workers on the same SFU.

New STT, LLM, or TTS drops on a Friday?

We just swap the block, restart the worker, and it's live by lunch.

No retraining. No monolithic rebuilds. Just composable parts evolving at their own pace.

What “Latency” Really Means (and Why It Hurts)

Human turn-taking is fast. Large-scale multilingual studies show that the median inter-turn gap is ≈ 200 ms, but the range spans from as low as 7 ms (in Japanese) to over 440 ms (in Danish), depending on the language, sentence structure, and context of the exchange [1].

A replication focused on English measured an average gap of 236 ms ± 520 ms SD, confirming that even within a single language, there’s wide variance depending on interaction type and formality.

When the silence between turns stretches, our perception shifts:

One-way gap	How it feels
< ≈ 400 ms	Still “natural”, but you notice a beat.
> ≈ 400 ms	ITU-T G.114 flags this as unacceptable for conversational quality.
> ≈ 600–700 ms	Most people label the call “robotic” or “satellite-delayed”.

These reference points form the benchmark we’re chasing:

get the bot’s first syllable inside the ~400 ms comfort zone—or, at the very least, close enough that the pause doesn’t break the conversational rhythm.

Anatomy of a Voice Pipeline

A real-time loop has three streaming stages that run strictly in series:

Stage	What it does	Latency metric
STT – Speech-to-Text	Turns audio frames into text tokens.	Final transcript time (but with proper streaming this is ≈ 0 ms relative to the next stage).
LLM – Large Language Model	Crafts the reply.	TTFT (Time to First Token): delay between sending the prompt and receiving the first generated token.
TTS – Text-to-Speech	Voices the reply.	TTFB (Time to First Byte): delay between sending the text and receiving the first playable PCM chunk.

Key observation: in every stack we measured, LLM TTFT + TTS TTFB account for 90 %+ of total loop time; with streaming recognizers, STT is effectively negligible.

All three stages run in streaming inference — we start passing tokens or audio frames downstream the moment we see them.

The Latency / Quality / Cost Triangle

Push one corner, the others move:

Lower latency ⇢ smaller / quantized models, “good-enough” neural voices.
Higher quality ⇢ bigger LLMs, premium TTS; usually slower.
Lower cost ⇢ open-source or micro-models; may ding both speed and fidelity.

Our job is to find the quickest loop that still sounds customer-ready and doesn’t torch the budget.

How We Benchmarked

Same system prompt in English and Spanish.
Dozens of STT + LLM + TTS combinations (cloud & OSS of which we have selected the top performing).
LiveKit measured STT duration, TTFT, TTFB on every turn.

A few things we learned fast

LLMs & TTS slow down outside English.
A long system prompt only punishes the first turn (~ +300 ms); later turns ride the KV-cache.
The newest “nano” LLMs plus an ultra-fast TTS can get that first syllable under 800 ms, scraping the human comfort ceiling.

#	STT	LLM (version)	TTS	Language	TTFT (1st / next)	TTS TTFB	First Byte Latency*	Tokens/s
1	Whisper-1 (no stream)	GPT-4o-mini	ElevenLabs	EN	0.34 / 0.34 s	0.42–0.47 s	3.1–3.9 s	19–48
2	Deepgram	GPT-4o-mini	ElevenLabs	EN	0.31–1.63 / 0.31–0.45 s	0.35–0.46 s	0.7–2.1 s	9–23
3	Deepgram	GPT-4.1-mini	ElevenLabs	EN	0.31–0.44 / 0.31–0.40 s	0.40–0.59 s	0.71–1.03 s	13–67
4	Deepgram	GPT-4.1-mini	ElevenLabs	ES	0.77–1.33 / 0.75–0.95 s	0.56–0.69 s	1.33–2.02 s	29–38
5	Deepgram	Gemini 1.5 Flash	ElevenLabs	EN	0.45–0.76 / 0.35–0.55 s	0.45–0.70 s	1.2–1.5 s	40–85
6	Deepgram	Gemini 1.5 Flash	ElevenLabs	ES	1.30–2.37 / 1.10–1.40 s	0.46–0.69 s	1.8–3.0 s	25–58
7	Deepgram	GPT-4.1-mini	Cartesia Sonic-2	EN	1.22–1.41 / 0.42–0.90 s	0.43–0.45 s	1.65–1.86 s	23–46
8	Deepgram	GPT-4.1-mini	Cartesia Sonic-2	ES	0.74–1.38 / 0.70–0.90 s	0.48–0.52 s	1.22–1.90 s	22–42
9	Deepgram	GPT-4.1-mini	Cartesia Sonic-Turbo	EN	1.15–1.24 / 0.44–0.65 s	0.38–0.41 s	1.53–1.65 s	17–45
10	Deepgram	GPT-4.1-mini	Cartesia Sonic-Turbo	ES	0.75–1.11 / 0.30–0.40 s	0.43–0.46 s	1.18–1.57 s	31–51
11	Deepgram	Gemini 1.5 Flash	Cartesia Sonic-Turbo	EN	1.19–1.27 / 1.19–1.27 s	0.40–0.43 s	1.59–1.70 s	12–44
12	Deepgram	Gemini 1.5 Flash	Cartesia Sonic-Turbo	ES	1.28–1.39 / 1.00–1.10 s	0.42–0.44 s	1.70–1.83 s	40–56
13	Deepgram	GPT-4.1-nano	Cartesia Sonic-Turbo	EN	0.90–0.97 / 0.30–0.40 s	0.42–0.52 s	0.73–1.45 s	40–105
14	Deepgram	GPT-4.1-nano	Cartesia Sonic-Turbo	ES	1.00–1.07 / 0.26–0.40 s	0.43–0.50 s	0.75–1.53 s	70–116

What the Numbers Tell Us

1. First-Turn Overhead Is Real

Every stack shows a heavier first turn because the LLM must ingest the entire system prompt before it can cache the KV-state.

Example:* in the GPT-4 Mini + Sonic-2 (EN) stack the first TTFT clocks at ≈ 1.22 s, but subsequent turns fall to ≈ 0.42–0.90 s. The “prompt tax” is ~300–800 ms, and it vanishes after turn 2 because the model re-uses its internal cache.

2. We’re Getting Closer to Human Latency — But Not Quite There Yet

Human comfort band: ~0.1–0.4 s one-way; anything above 0.6–0.7 s starts to feel "robotic."
Best first syllable today: 0.73 s (GPT-4 Nano + Sonic-Turbo, EN) and 0.75 s (same stack, ES). That’s about 2× slower than a natural gap, but already below the ITU’s 400 ms threshold for unacceptable RTT.
Second turn latency: Since TTFT drops to 0.26–0.40 s and TTFB remains around 0.43 s, many loops land just under 0.7–0.8 s—close enough that most users don’t perceive a delay.

3. English Still Wins the Speed Race

Across the board, Spanish incurs an extra +300–500 ms in TTFT, and often a few additional milliseconds in TTFB.

This isn't surprising: most language models are trained on English-dominant datasets, and their tokenizers are optimized for English morphology. That means fewer tokens per word, higher-confidence predictions, and faster decoding paths.

In contrast, other languages often lead to:

More tokens per sentence (due to suboptimal tokenization),
Less frequent vocabulary (slower logits resolution),
Slightly longer prompts (higher input load),
And more uncertainty during generation (costlier decoding).

Model providers are still actively optimizing multilingual performance—but for now, English remains the latency benchmark.

4. STT Is a Non-Issue (When Streamed)

Deepgram’s streaming mode continually emits tokens, so by the time the user finishes speaking the transcript is already done. < 5 ms in our logs—effectively zero.

Which Stack for Whom?

Goal	Stack to Watch	Why
Ultra-low latency (< 0.8 s first byte)	GPT-4 Nano + Cartesia Sonic-Turbo (rows 13–14)	Fastest TTFT (< 1 s first turn, < 0.40 s thereafter). Great for IVRs, live game NPCs, or any app where “snappiness” beats eloquence. Expect slightly terser, less nuanced language.
Balanced latency & quality	GPT-4 Mini + Cartesia Sonic-2 / Sonic-Turbo (rows 7–8 & 10–11)	Adds ~150-250 ms but yields noticeably richer wording and better reasoning. Sweet spot for customer support or sales calls where tone matters.
Language coverage beyond English	Mini or Nano stacks + Sonic-Turbo (ES)	Spanish numbers are catching up; Sonic voices remain natural and the Nano drop still delivers TTFT < 1.1 s.
Premium voice fidelity	ElevenLabs + Mini stacks (rows 1–4)	Neural voices lead the market in prosody; latency penalty is ~0.05–0.1 s vs. Sonic-Turbo—fine for podcasts, high-touch brand experiences.

(Quality judgments are subjective; we used blind AB tests on 30 clips per stack.)

Conclusions & Near-Term Outlook

Composable beats monolithic—today.

Because STT, LLM, and TTS evolve on different cadences, a modular pipeline lets you upgrade components the moment something faster drops—unlike monolithic models, where you must wait for the next provider release.
Sub-second voice loops are already viable for English and edging in for Spanish. With smarter caching, phoneme-level streaming, and incremental TTS we expect < 500 ms within a year.
Model shrinkage will continue.

“Nano” and “flash” checkpoints show that aggressive distillation + quantization can keep quality “good enough” while halving latency every generation.
Edge deployment is accelerating.

Thanks to aggressive quantization (8-bit and even 4-bit), large language and speech models are now deployable on local hardware—consumer GPUs, mobile NPUs, and even embedded systems. This allows parts of the voice loop to run on-device, cutting out network delays and shaving 50–150 ms off total latency.

Source: “AI Voice Inference at the Edge is Finally Here,” VoiceTech Insights, 2025
Joint LLM-TTS training is emerging.

A new generation of end-to-end speech models is beginning to bypass traditional TTS stages entirely. These models, like VITA-Audio (2025), predict multiple audio tokens in a single step, generating speech directly from text while drastically reducing inference time. Once stable in streaming mode, these architectures could cut TTS latency to mere milliseconds.

Source: “VITA-Audio: Parallel Token-to-Audio Generation with Context-Aware Semantic Guidance,” arXiv, May 2025

Bottom line: We’re only a few iteration cycles away from voice agents that consistently reply in the same temporal rhythm as humans. If you build with LiveKit-style modular pipelines today, you can ride that curve with an overnight adjustment.

Stay tuned—the sub-half-second voice loop is closer than most teams think.

Building Voice AI Agents with the OpenAI Agents SDK

Roman Piacquadio — Mon, 05 May 2025 12:49:15 +0000

Beyond Single Turns: OpenAI Enters the Voice Agent Arena

In our previous post, Building Multi-Agent Conversations with WebRTC & LiveKit, we explored how to create complex, multi-stage voice interactions using the real-time power of WebRTC and the orchestration capabilities of the LiveKit Agents framework. We saw how crucial low latency and effective state management are for natural conversations, especially when handing off between different agent roles.

Recently, OpenAI has significantly enhanced its offerings for building agentic systems, including dedicated tools and SDKs for creating voice agents. While the core concept of chaining Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS) remains, OpenAI now provides more integrated primitives and an SDK designed to simplify this process, particularly within their ecosystem.

This article dives into building voice agents using the OpenAI Agents SDK. We'll examine its architecture, walk through a Python example, and critically compare this approach with the LiveKit method discussed previously, highlighting the strengths, weaknesses, and ideal use cases for each.

OpenAI's Vision for Agents: Primitives and Orchestration

OpenAI positions its platform as a set of composable primitives for building agents, covering domains like:

Models: Core intelligence (GPT-4o, the latest GPT-4.1 and GPT-4.1-mini, etc.) capable of reasoning and handling multimodality.
Tools: Interfaces to the outside world, including developer-defined function calling, built-in web search, file search, etc.
Knowledge & Memory: Using Vector Stores and Embeddings for context and persistence.
Audio & Speech: Primitives for understanding and generating voice.
Guardrails: Moderation and instruction hierarchy for safety and control.
Orchestration: The Agents SDK, Tracing, Evaluations, and Fine-tuning to manage the agent lifecycle.

For Voice Agents, OpenAI presents two main architectural paths:

Speech-to-Speech (Multimodal - Realtime API): Uses models like gpt-4o-realtime-preview that process audio input directly and generate audio output, aiming for the lowest latency and understanding vocal nuances. This uses a specific Realtime API separate from the main Chat Completions API.
Chained (Agents SDK + Voice): The more traditional STT → LLM → TTS flow, but orchestrated using the openai-agents SDK with its [voice] extension. This provides more transparency (text transcripts at each stage) and control, making it easier to integrate into existing text-based agent workflows.

This post will focus on the Chained architecture using the OpenAI Agents SDK, as it aligns more closely with common agent development patterns and provides a clearer comparison point to the plugin-based approach of LiveKit.

The OpenAI Agents SDK: Simplifying Agent Logic

The openai-agents Python SDK aims to provide a lightweight way to build agents with a few core concepts:

Agent: An LLM equipped with instructions, tools, and potentially knowledge about when to hand off tasks.
Handoffs: A mechanism allowing one agent to delegate tasks to another, more specialized agent. Agents are configured with a list of potential agents they can hand off to.
Tools (@function_tool): Decorator to easily expose Python functions to the agent, similar to standard OpenAI function calling.
Guardrails: Functions to validate inputs or outputs and enforce constraints.
Runner: Executes the agent logic, handling the loop of calling the LLM, executing tools, and managing handoffs.
VoicePipeline (with [voice] extra): Wraps an agent workflow (like one using Runner) to handle the STT and TTS parts of a voice interaction.

The philosophy is "Python-first," relying on Python's built-in features for orchestration rather than introducing many complex abstractions.

Architecture with OpenAI Agents SDK (Chained Voice)

When using the VoicePipeline from the SDK, the typical flow for a voice turn looks like this:

Audio Input: Raw audio data (e.g., from a microphone) is captured.
VoicePipeline (STT): The pipeline receives audio chunks. It uses an OpenAI STT model (like gpt-4o-transcribe via the API) to transcribe the user's speech into text once speech ends (or via push-to-talk).
Agent Workflow Execution (MyWorkflow.run in the example):
- The transcribed text is passed to your defined workflow (e.g., a class inheriting from VoiceWorkflowBase).
- Inside the workflow, the Runner is invoked with the current Agent, conversation history, and the new user text.
- The Agent (LLM) decides whether to respond directly, call a Tool (function), or Handoff to another agent based on its instructions and the user input.
- If a tool is called, the Runner executes the Python function and sends the result back to the LLM.
- If a handoff occurs, the Runner switches context to the new agent.
- The LLM generates the text response.
VoicePipeline (TTS): The final text response from the agent workflow is sent to an OpenAI TTS model (e.g., gpt-4o-mini-tts) via the API to generate audio.
Audio Output: The generated audio data is streamed back to be played to the user.

(Diagram: Microphone feeds audio to VoicePipeline for STT. Text goes to Agent Workflow (using Runner, Agent, Tools, Handoffs). Text response goes back to VoicePipeline for TTS, then to Speaker.)

This contrasts with the LiveKit architecture where WebRTC handles the audio transport layer directly, and the livekit-agents framework integrates STT/LLM/TTS plugins into that real-time stream.

Let's Build: The Multi-Lingual Assistant (Python Example)

Let's break down the key parts of the official OpenAI Agents SDK voice example. (Link to the repository will be at the end).

1. Prerequisites

Python 3.8+
OpenAI API Key.
Install the SDK with voice extras:

pip install "openai-agents[voice]" sounddevice numpy python-dotenv textual # For the demo UI

2. Setup (.env file)

# .env
OPENAI_API_KEY="sk-..."

3. Core Agent Logic (`my_workflow.py`)

This file defines the agents and the workflow logic that runs after speech is transcribed to text and before the response text is sent for synthesis.

Imports: Necessary components from agents SDK (Agent, Runner, function_tool, VoiceWorkflowBase, etc.).
Tool Definition (get_weather): A simple Python function decorated with @function_tool to make it callable by the agent. The SDK handles generating the schema for the LLM.

import random
from collections.abc import AsyncIterator
from typing import Callable

from agents import Agent, Runner, TResponseInputItem, function_tool
from agents.extensions.handoff_prompt import prompt_with_handoff_instructions
from agents.voice import VoiceWorkflowBase, VoiceWorkflowHelper

@function_tool
def get_weather(city: str) -> str:
    """Get the weather for a given city."""
    print(f"[debug] get_weather called with city: {city}")
    choices = ["sunny", "cloudy", "rainy", "snowy"]
    return f"The weather in {city} is {random.choice(choices)}."

Agent Definitions (spanish_agent, agent):
- Each Agent is created with a name, instructions (using a helper prompt_with_handoff_instructions to guide its behavior regarding handoffs), a model, and optionally tools it can use and other handoffs it can initiate.
- The handoff_description helps the calling agent decide which agent to hand off to.

spanish_agent = Agent(
    name="Spanish",
    handoff_description="A spanish speaking agent.",
    instructions=prompt_with_handoff_instructions(
        "You're speaking to a human, so be polite and concise. Speak in Spanish.",
    ),
    model="gpt-4.1",
)

agent = Agent(
    name="Assistant",
    instructions=prompt_with_handoff_instructions(
        "You're speaking to a human, so be polite and concise. If the user speaks in Spanish, handoff to the spanish agent.",
    ),
    model="gpt-4.1",
    handoffs=[spanish_agent], # List of agents it can hand off to
    tools=[get_weather],      # List of tools it can use
)

Workflow Class (MyWorkflow):
- Inherits from VoiceWorkflowBase.
- __init__: Stores configuration (like the secret_word for a simple game logic) and maintains state like conversation history (_input_history) and the currently active agent (_current_agent).
- run(transcription: str): This is the core method called by the VoicePipeline after STT.
- It receives the user's transcribed text.
- Updates the conversation history.
- Contains custom logic (like checking for the secret word).
- Invokes Runner.run_streamed with the current agent and history. This handles the interaction with the LLM, tool calls, and potential handoffs based on the agent's configuration.
- Uses VoiceWorkflowHelper.stream_text_from to yield text chunks as they are generated by the LLM (enabling faster TTS start).
- Updates the history and potentially the _current_agent based on the Runner's result (if a handoff occurred).

class MyWorkflow(VoiceWorkflowBase):
    def __init__(self, secret_word: str, on_start: Callable[[str], None]):
        # ... (init stores history, current_agent, secret_word, callback) ...
        self._input_history: list[TResponseInputItem] = []
        self._current_agent = agent
        self._secret_word = secret_word.lower()
        self._on_start = on_start # Callback for UI updates

    async def run(self, transcription: str) -> AsyncIterator[str]:
        self._on_start(transcription) # Call the UI callback

        self._input_history.append({"role": "user", "content": transcription})

        if self._secret_word in transcription.lower(): # Custom logic example
            yield "You guessed the secret word!"
            # ... (update history) ...
            return

        # Run the agent logic using the Runner
        result = Runner.run_streamed(self._current_agent, self._input_history)

        # Stream text chunks for faster TTS
        async for chunk in VoiceWorkflowHelper.stream_text_from(result):
            yield chunk

        # Update state for the next turn
        self._input_history = result.to_input_list()
        self._current_agent = result.last_agent # Agent might have changed via handoff

4. Client & Pipeline Setup (`main.py`)

This file sets up a simple Textual-based UI and manages the audio input/output and the VoicePipeline.

It initializes sounddevice for microphone input and speaker output.
Creates the VoicePipeline, passing in the MyWorkflow instance.
Uses StreamedAudioInput to feed microphone data into the pipeline.
Starts the pipeline using pipeline.run(self._audio_input).
Asynchronously iterates through the result.stream() to:
- Play back audio chunks (voice_stream_event_audio).
- Display lifecycle events or transcriptions in the UI.
Handles starting/stopping recording based on key presses ('k'). (Note: We won't dive deep into the Textual UI code here, focusing instead on the agent interaction pattern.)

5. Running the Example

Ensure .env is set up.
Run the main script: python main.py
Press 'k' to start recording, speak, press 'k' again to stop. The agent should respond.

Comparing Approaches: OpenAI Agents SDK vs. LiveKit Agents

Both frameworks allow building sophisticated voice agents with multiple roles, but they excel in different areas due to their underlying philosophies and technologies:

Feature	OpenAI Agents SDK (Chained Voice)	LiveKit Agents Framework (WebRTC)
Core Technology	🐍 Python SDK orchestrating OpenAI APIs (STT, LLM, TTS)	🌐 Python Framework built on LiveKit & WebRTC
Latency	⚠️ Higher (API calls for STT, LLM, TTS per turn)	✅ Lower (Direct WebRTC streaming, optimized for voice)
Real-time Audio	⚠️ Handled by SDK (VoicePipeline), abstracts away stream	✅ Core feature via WebRTC, fine-grained control possible
Setup Complexity	✅ Generally Lower (mainly SDK install & API keys)	⚠️ Higher (Requires LiveKit server setup/cloud account)
STT/TTS Flexibility	⚠️ Primarily uses OpenAI models via API.	✅ Plugin-based (OpenAI, Deepgram, Google, etc.) easy swap
LLM Flexibility	⚠️ Uses OpenAI models via API.	✅ Plugin-based (OpenAI, Anthropic, Local models, etc.)
Interruption Handling	⚠️ Not built-in for StreamedAudioInput. Requires manual implementation listening to lifecycle events.	✅ Built-in using VAD plugins (e.g., Silero).
State Management	⚠️ Managed within Python workflow (e.g., list history)	✅ Explicit userdata on AgentSession, shared state
Multi-Agent Handoff	✅ Declarative (handoffs list in Agent)	⚠️ Imperative (Agent function returns next agent instance)
Ecosystem	✅ Integrated with OpenAI Tracing, Evals, Fine-tuning.	⚠️ Focused on real-time communication infrastructure.
Scalability	⚠️ Depends on Python deployment & API limits.	✅ Built on scalable WebRTC infrastructure (LiveKit).

Note on OpenAI Realtime API: OpenAI does offer the gpt-4o-realtime-preview model via a separate Realtime API for true speech-to-speech with potentially very low latency. However, this is a different architecture than the Agents SDK VoicePipeline discussed here, uses specific models, and has its own implementation details.

When to Choose Which?

Choose OpenAI Agents SDK (Chained Voice) When:
- You primarily want to work within the OpenAI ecosystem (Models, Tracing, Evals).
- Your application can tolerate slightly higher latency inherent in the chained API calls.
- You prefer a simpler initial setup without managing WebRTC infrastructure.
- You need transparency with text transcripts at each stage (STT output, LLM input/output).
- Built-in, low-latency interruption handling is not a critical out-of-the-box requirement.
- Your core logic is already text-based, and you're adding a voice interface.
Choose LiveKit Agents Framework When:
- Minimizing latency is paramount for natural turn-taking.
- You need robust, built-in interruption handling.
- You require flexibility to choose and easily swap different STT, LLM, and TTS providers (including non-OpenAI or self-hosted).
- You need fine-grained control over the real-time audio/video streams (WebRTC).
- You are building applications that inherently benefit from a "room"-based model (e.g., multiple users, agent joining calls).
- Scalability for many concurrent real-time connections is a primary concern.

Conclusion

OpenAI's introduction of the Agents SDK, especially with its voice capabilities, provides a compelling and relatively straightforward path for developers already invested in their ecosystem to build voice agents. The VoicePipeline abstracts away some of the complexities of the STT → LLM → TTS chain. Its strengths lie in integration with OpenAI's tools (like tracing) and the declarative nature of defining agents, tools, and handoffs.

However, for applications demanding the absolute lowest latency, seamless interruption handling, and maximum flexibility in choosing underlying AI models, the WebRTC-based approach offered by frameworks like LiveKit Agents remains a very strong contender. It requires more infrastructure setup but provides unparalleled control over the real-time aspects of the conversation.

The choice depends heavily on your specific project requirements, tolerance for latency, need for flexibility, and existing technology stack. Both approaches offer powerful ways to move beyond simple bots and create truly interactive voice AI experiences.

Explore the OpenAI Agents SDK Documentation.
Check out the OpenAI Agents GitHub Repository and the voice example.
Learn about the OpenAI Realtime API for speech-to-speech.
Revisit the LiveKit Agents Documentation for comparison.

What are your thoughts on these different approaches to building voice agents? Let me know in the comments!

Building Multi-Agent Conversations with WebRTC & LiveKit

Roman Piacquadio — Thu, 10 Apr 2025 12:53:13 +0000

From Simple Bots to Dynamic Conversations

We've all seen the basic voice AI demos – ask a question, get an answer. But real-world interactions often involve multiple stages, different roles, or specialized knowledge. How do you build a voice AI system that can gracefully handle introductions, gather information, perform a core task, and then provide a conclusion, potentially using different AI "personalities" or models along the way?

Chaining traditional REST API calls for STT, LLM, and TTS already introduces latency and state management headaches for a single agent turn. Trying to orchestrate multiple logical agents or conversational stages this way becomes exponentially more complex, laggy, and brittle.

This article explores a powerful solution: building multi-agent voice AI sessions using WebRTC for real-time communication and the LiveKit Agents framework for orchestration. We'll look at a practical Python example of a "storyteller" agent that first gathers user info and then hands off to a specialized story-generating agent, all within a single, low-latency voice call.

Why Does the Standard API Approach Fall Short (Especially for Multi-Agent)?

The typical STT → LLM → TTS cycle via separate API calls suffers from:

🐌 High Latency: Each step adds delay, making turn-taking slow.
💸 Potential High Cost: Multiple API calls per user turn can get expensive.
🧠 State Management Hell: Keeping track of conversation history and shared data across different logical stages or agents is difficult with stateless APIs.
📉 Reliability & Scalability Issues: A single backend process trying to juggle multiple users and complex state logic becomes a bottleneck and point of failure.
😠 Robotic Interaction: Difficulty handling interruptions or smoothly transitioning between conversational goals.

The Solution: Real-Time Foundations with WebRTC & LiveKit

WebRTC (Web Real-Time Communication): This browser and mobile standard allows direct, low-latency audio/video/data streaming between participants. It's the foundation for seamless voice calls.

LiveKit: An open-source infrastructure layer that makes building scalable, reliable WebRTC applications much easier. For voice AI, it provides:

Signaling & Room Management: Handles participant connections, discovery, and the state of the "room" where the conversation happens. This is our "virtual conference room."
Optimized Media Streaming: Ensures audio flows efficiently.
Scalability: Designed for many concurrent users.
Agent Framework (livekit-agents): A specific Python library (and SDKs for other languages) built on LiveKit, designed explicitly for creating voice (and text) AI agents. It manages the complexities of:
- Real-time audio stream processing.
- Integrating STT, LLM, TTS plugins.
- Handling interruptions (using VAD - Voice Activity Detection).
- Crucially for this article: Managing multiple agents within a single session and facilitating handoffs and shared state.

Multi-Agent Architecture with LiveKit Agents

The provided example demonstrates a pattern where agents collaborate within a single AgentSession:

User & Session Start: The user connects to a LiveKit room. An AgentSession is created, managing the overall interaction and holding shared userdata. The initial agent (IntroAgent) is added to the session.
Agent 1 (IntroAgent) Execution:

* Receives the user's audio stream.
* Uses its configured STT, LLM (e.g., GPT-4o-mini), and TTS to interact according to its specific `instructions` (gather name and location).
* The LLM identifies when the required information is gathered.

Agent Handoff:

* The LLM triggers a specific function call defined on `IntroAgent` (e.g., `information_gathered`).
* This function receives the gathered data (`name`, `location`).
* It **updates the shared `userdata`** within the `AgentSession`.
* It **creates an instance of the *next* agent** (`StoryAgent`), potentially passing it the gathered data or the existing chat context.
* It **returns the `StoryAgent` instance** to the `AgentSession`.

Agent 2 (StoryAgent) Activation:

* The `AgentSession` replaces `IntroAgent` with `StoryAgent`.
* `StoryAgent`'s `on_enter` method might be called to kick off its part of the conversation.
* It now handles the user's audio stream, using its *own* potentially different configuration (e.g., real-time OpenAI LLM with built-in voice) and `instructions` (tell a story using the name/location from `userdata`).

Further Interaction & Termination:

* The `StoryAgent` interacts with the user.
* When the story concludes (potentially triggered by another function call like `story_finished`), the agent can generate a final message and even initiate disconnecting the user or closing the LiveKit room.

AgentSession -> Agent 1 <-> Agent 2" width="800" height="533">
(Diagram: User interacts with the AgentSession, which activates Agent 1 or Agent 2. Agents can access/modify shared userdata and trigger handoffs.)

Let's Build: The Multi-Agent Storyteller (Python)

Let's break down the key parts of the provided code. (There will be a link to the official livekit repository at the end)

1. Prerequisites:

Python 3.8+
LiveKit Cloud account (free tier) or self-hosted server (URL, API Key, API Secret).
OpenAI API Key.
Deepgram API Key.
(Optional) Silero VAD model (downloaded automatically by the plugin).
Install libraries:

pip install "livekit-agents[openai,silero,deepgram]~=1.0rc" python-dotenv # Add any other plugins you might use

2. Setup (`.env` file):

# .env
LIVEKIT_URL="wss://YOUR_PROJECT_URL.livekit.cloud"
LIVEKIT_API_KEY="YOUR_LK_API_KEY"
LIVEKIT_API_SECRET="YOUR_LK_API_SECRET"

OPENAI_API_KEY="sk-..."
DEEPGRAM_API_KEY="..."
# Optional webhook for StoryAgent's finish function if needed
# CRM_WEBHOOK_URL="..."

3. Core Agent Code Breakdown (agent.py):

Imports: Loads essential Python standard libraries (logging, dataclasses, typing, dotenv) and core components from the livekit-agents SDK:
- Agent, AgentSession, and RunContext: Power the lifecycle and logic of agents.
- RoomInputOptions and RoomOutputOptions: Configure how audio is handled in the LiveKit room.
- function_tool: Allows the LLM to expose callable functions as part of the agent logic.
- deepgram, openai, silero: Pre-built plugin integrations for STT (speech-to-text), LLM (language model), TTS (text-to-speech), and VAD (voice activity detection).

This setup gives you full control and modularity over what models and services each agent uses for voice and language tasks.

import logging
from dataclasses import dataclass
from typing import Optional

from dotenv import load_dotenv

from livekit import api
from livekit.agents import (
    Agent,
    AgentSession,
    ChatContext,
    JobContext,
    JobProcess,
    RoomInputOptions,
    RoomOutputOptions,
    RunContext,
    WorkerOptions,
    cli,
    metrics,
)
from livekit.agents.job import get_current_job_context
from livekit.agents.llm import function_tool
from livekit.agents.voice import MetricsCollectedEvent
from livekit.plugins import deepgram, openai, silero

Logger & Environment Setup:
- logger = logging.getLogger("multi-agent"): Sets up a logger instance named "multi-agent" for outputting logs throughout the agent lifecycle. Helpful for debugging and usage tracking.
- load_dotenv(): Loads environment variables from a .env file, allowing credentials or config values (e.g., API keys for OpenAI or Deepgram) to be securely managed outside the code.

logger = logging.getLogger("multi-agent")
load_dotenv()

Instruction Prompt:
- common_instructions: A base prompt string used to define the persona and behavior of both agents. In this case, the agent is introduced as "Echo", a friendly and curious storyteller who interacts via voice.

common_instructions = (
    "Your name is Echo. You are a story teller that interacts with the user via voice."
)

StoryData Dataclass:

@dataclass
class StoryData:
    name: Optional[str] = None
    location: Optional[str] = None

This simple structure holds the shared state (user's name and location) that needs to persist between agents. It's attached to the AgentSession.

IntroAgent:
- __init__: Sets specific instructions to gather name and location. Uses default models configured later in the AgentSession.
- on_enter: Called when this agent becomes active. It immediately prompts the LLM to generate a reply based on its instructions (the introduction).
- information_gathered Function Tool:

class IntroAgent(Agent):
    def __init__(self) -> None:
        super().__init__(
            instructions=f"{common_instructions} Your goal is to gather a few pieces of "
            "information from the user to make the story personalized and engaging."
            "You should ask the user for their name and where they are from."
            "Start the conversation with a short introduction.",
        )

    async def on_enter(self):
        # when the agent is added to the session, it'll generate a reply
        # according to its instructions
        self.session.generate_reply()

    @function_tool
    async def information_gathered(
        self,
        context: RunContext[StoryData],
        name: str,
        location: str,
    ):
        """Called when the user has provided the information needed to make the story
        personalized and engaging.

        Args:
            name: The name of the user
            location: The location of the user
        """

        context.userdata.name = name
        context.userdata.location = location

        story_agent = StoryAgent(name, location)
        # by default, StoryAgent will start with a new context, to carry through the current
        # chat history, pass in the chat_ctx
        # story_agent = StoryAgent(name, location, chat_ctx=context.chat_ctx)

        return story_agent, "Let's start the story!"

This is the key handoff mechanism. The LLM calls this function. It updates userdata, creates StoryAgent, and returns it to the session manager.

StoryAgent:
- __init__: Takes name and location (retrieved from userdata by the caller). Sets its own instructions incorporating this data.
- Crucially, it overrides the LLM to use openai.realtime.RealtimeModel (which includes voice output) and sets tts=None. This shows agent-specific model configuration.
- It can optionally receive the chat_ctx to continue the history.
- on_enter: Similar to IntroAgent, starts the interaction.
- story_finished Function Tool: Allows the LLM to signal the end of the story, generate a goodbye, and terminate the room via the LiveKit API.

class StoryAgent(Agent):
    def __init__(self, name: str, location: str, *, chat_ctx: Optional[ChatContext] = None) -> None:
        super().__init__(
            instructions=f"{common_instructions}. You should use the user's information in "
            "order to make the story personalized."
            "create the entire story, weaving in elements of their information, and make it "
            "interactive, occasionally interating with the user."
            "do not end on a statement, where the user is not expected to respond."
            "when interrupted, ask if the user would like to continue or end."
            f"The user's name is {name}, from {location}.",
            # each agent could override any of the model services, including mixing
            # realtime and non-realtime models
            llm=openai.realtime.RealtimeModel(voice="echo"),
            tts=None,
            chat_ctx=chat_ctx,
        )

    async def on_enter(self):
        # when the agent is added to the session, we'll initiate the conversation by
        # using the LLM to generate a reply
        self.session.generate_reply()

    @function_tool
    async def story_finished(self, context: RunContext[StoryData]):
        """When you are fininshed telling the story (and the user confirms they don't
        want anymore), call this function to end the conversation."""
        # interrupt any existing generation
        self.session.interrupt()

        # generate a goodbye message and hang up
        # awaiting it will ensure the message is played out before returning
        await self.session.generate_reply(
            instructions=f"say goodbye to {context.userdata.name}", allow_interruptions=False
        )

        job_ctx = get_current_job_context()
        lkapi = job_ctx.api
        await lkapi.room.delete_room(api.DeleteRoomRequest(room=job_ctx.room.name))

prewarm Function: Loads the Silero VAD (Voice Automatic Detection) model once when the worker starts, avoiding redundant loading for each session.

def prewarm(proc: JobProcess):
    proc.userdata["vad"] = silero.VAD.load()

entrypoint Function:
- Connects to the LiveKit room.
- Creates the AgentSession:
- Passes the prewarmed VAD.
- Sets the default STT, LLM, and TTS plugins (agents can override these).
- Initializes the shared userdata with an empty StoryData instance.
- Sets up metrics collection (good practice!).
- Starts the session:
- Crucially, passes the initial agent (IntroAgent()) to the start method.
- Configures room input/output options (like noise cancellation or transcription).
- Includes a loop to keep the agent process alive.

async def entrypoint(ctx: JobContext):
    await ctx.connect()

    session = AgentSession[StoryData](
        vad=ctx.proc.userdata["vad"],
        # any combination of STT, LLM, TTS, or realtime API can be used
        llm=openai.LLM(model="gpt-4o-mini"),
        stt=deepgram.STT(model="nova-3"),
        tts=openai.TTS(voice="echo", model="gpt-4o-mini-tts"),
        userdata=StoryData(),
    )

    # log metrics as they are emitted, and total usage after session is over
    usage_collector = metrics.UsageCollector()

    @session.on("metrics_collected")
    def _on_metrics_collected(ev: MetricsCollectedEvent):
        metrics.log_metrics(ev.metrics)
        usage_collector.collect(ev.metrics)

    async def log_usage():
        summary = usage_collector.get_summary()
        logger.info(f"Usage: {summary}")

    ctx.add_shutdown_callback(log_usage)

    await session.start(
        agent=IntroAgent(),
        room=ctx.room,
        room_input_options=RoomInputOptions(
            # uncomment to enable Krisp BVC noise cancellation
            # noise_cancellation=noise_cancellation.BVC(),
        ),
        room_output_options=RoomOutputOptions(transcription_enabled=True),
    )


if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))

4. Running & Connecting:

Run the agent: python agent.py
Connect using the Agent Playground (link below) or your own client, pointing to your LiveKit instance, ensuring you join the room the agent is listening for (usually determined by how the agent job is launched or configured).

Why This Multi-Agent Approach Rocks

🧩 Modular Roles: Each agent focuses on a specific task with its own instructions and even different AI models.
🧼 Clean State Management: userdata provides a clear way to share necessary information between agents.
🔁 Seamless Handoffs: Function calls provide a natural mechanism for transitioning conversational stages.
⚡ Low Latency: Still benefits from WebRTC’s real-time streaming.
🧠 Flexibility: Mix and match standard and real-time models, different STT/TTS providers per agent.
🏗️ Scalable: Built on the robust LiveKit infrastructure.

Conclusion

Managing complex, multi-stage voice conversations requires moving beyond simple request-response cycles. The LiveKit Agents framework, built on the real-time foundation of WebRTC, provides elegant solutions for orchestrating multiple agents, managing shared state, and facilitating smooth handoffs – all while maintaining low latency.

This storyteller example showcases the power of this approach, allowing different AI "personalities" or specialists to collaborate within a single, natural-feeling voice session.

Dive deeper into the LiveKit Agents Documentation.
Explore the full multi-agent example code.
To use as a client explore the agent-playground example code.
Sign up for LiveKit Cloud to get started quickly.

What kind of multi-agent voice interactions would you build with this? Share your ideas in the comments!