<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Roman Piacquadio</title>
    <description>The latest articles on DEV Community by Roman Piacquadio (@roman_piacquadio).</description>
    <link>https://dev.to/roman_piacquadio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2981858%2F0d97658c-b5e9-4145-941f-81e97b82fa1e.jpg</url>
      <title>DEV Community: Roman Piacquadio</title>
      <link>https://dev.to/roman_piacquadio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/roman_piacquadio"/>
    <language>en</language>
    <item>
      <title>How Much Does It Really Cost to Run a Voice-AI Agent at Scale?</title>
      <dc:creator>Roman Piacquadio</dc:creator>
      <pubDate>Tue, 20 May 2025 17:00:13 +0000</pubDate>
      <link>https://dev.to/cloudx/how-much-does-it-really-cost-to-run-a-voice-ai-agent-at-scale-8en</link>
      <guid>https://dev.to/cloudx/how-much-does-it-really-cost-to-run-a-voice-ai-agent-at-scale-8en</guid>
      <description>&lt;h2&gt;
  
  
  1) Why Voice Automation Is Worth Investigating (Even If You’re Not Replacing Humans)
&lt;/h2&gt;

&lt;p&gt;Voice automation has made significant progress in recent years, powered by improvements in transcription, real-time audio routing, and large language models. What was once a clunky, robotic experience is now capable of holding fluent, natural-sounding conversations with real people—across a variety of use cases.&lt;/p&gt;

&lt;p&gt;This doesn’t mean we’re replacing human agents. Quite the opposite: automation lets us offload the &lt;em&gt;repetitive, high-frequency, low-complexity tasks&lt;/em&gt; that tend to burn out human teams, and instead reserve human attention for edge cases, escalations, and creative problem solving.&lt;/p&gt;

&lt;p&gt;Whether you're handling inbound customer service, qualifying outbound leads, or proactively following up on time-sensitive actions, a well-orchestrated voice-AI pipeline can free up valuable resources—&lt;em&gt;if the economics make sense&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;That last part is key. Many developers assume that using AI to automate voice interactions is inherently cost-effective. But is it really cheaper than staffing a team? How much do you actually pay per call once you add up every component: speech recognition, synthesis, telephony, model inference, and orchestration?&lt;/p&gt;

&lt;p&gt;This article takes a grounded approach to that question. We'll break down a full-stack voice-AI system—from SIP trunk to final response—and price each piece out based on real-world usage. To make it concrete, we'll walk through a common use case: outbound phone calls of around 3 minutes in duration. But the same framework applies to inbound routing, reminders, surveys, or any other automated voice interaction.&lt;/p&gt;

&lt;p&gt;Let’s dig into how it all connects—and what it really costs to make it work at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  2) System Architecture: How a Voice-AI Pipeline Works End-to-End
&lt;/h2&gt;

&lt;p&gt;Before jumping into costs, it’s important to understand how the system is architected. At a high level, a voice-based AI interaction consists of several components working together in real time to process speech, generate responses, and keep the conversation flowing naturally.&lt;/p&gt;

&lt;p&gt;Here’s a simplified view of the architecture:&lt;/p&gt;

&lt;p&gt;Caller (PSTN or SIP)&lt;br&gt;
⬇️&lt;br&gt;
SIP Trunk (Twilio or Telnyx)&lt;br&gt;
⬇️&lt;br&gt;
LiveKit (Voice orchestration &amp;amp; media routing)&lt;br&gt;
⬇️&lt;br&gt;
Speech-to-Text (Deepgram)&lt;br&gt;
⬇️&lt;br&gt;
Language Model (OpenAI GPT-4.1 mini)&lt;br&gt;
⬇️&lt;br&gt;
Text-to-Speech (ElevenLabs or Cartesia)&lt;br&gt;
⬇️&lt;br&gt;
LiveKit (returns audio back)&lt;br&gt;
⬇️&lt;br&gt;
Caller&lt;/p&gt;

&lt;p&gt;Each component has a distinct role:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SIP Trunk (Twilio or Telnyx):&lt;/strong&gt; Bridges public phone lines to our backend system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiveKit:&lt;/strong&gt; Manages the real-time audio streams and orchestrates audio I/O between services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deepgram:&lt;/strong&gt; Transcribes what the user says into text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4.1 mini:&lt;/strong&gt; Interprets the message and generates a textual response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text-to-Speech (TTS):&lt;/strong&gt; Converts the AI response into natural-sounding speech.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiveKit (again):&lt;/strong&gt; Sends the generated audio back to the caller, closing the loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're interested in building this kind of setup yourself, I’ve written a few posts on setting up each component individually (from SIP trunking to LiveKit integration). You can check those out here: &lt;strong&gt;[&lt;a href="https://dev.to/roman_piacquadio/series/31126"&gt;https://dev.to/roman_piacquadio/series/31126&lt;/a&gt;]&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the next section, we’ll define the assumptions behind our usage model so we can start assigning real-world costs to each of these layers.&lt;/p&gt;

&lt;h2&gt;
  
  
  3) Baseline Assumptions for Cost Estimation
&lt;/h2&gt;

&lt;p&gt;To make the cost analysis meaningful, we need a realistic usage scenario. We’ll base our calculations on the following assumptions, which simulate a small team operating at steady scale:&lt;/p&gt;

&lt;h3&gt;
  
  
  📞 Call Volume
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Each automated call lasts &lt;strong&gt;3 minutes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The AI system handles &lt;strong&gt;100 calls per day&lt;/strong&gt;, simulating the workload of one full-time agent.&lt;/li&gt;
&lt;li&gt;We simulate &lt;strong&gt;10 AI agents&lt;/strong&gt;, running &lt;strong&gt;22 business days per month&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;That results in &lt;strong&gt;22,000 calls per month&lt;/strong&gt;, totaling &lt;strong&gt;66,000 minutes of audio&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🗣️ Talk Time Distribution
&lt;/h3&gt;

&lt;p&gt;In a natural two-way conversation, only one party speaks at a time. For simplicity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;AI speaks for ~50%&lt;/strong&gt; of the time → &lt;strong&gt;33,000 minutes of TTS&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;caller (human)&lt;/strong&gt; speaks for the other 50% → &lt;strong&gt;33,000 minutes of STT&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This split is a reasonable average for structured interactions such as confirmations, reminders, qualification flows, or outbound follow-ups.&lt;/p&gt;

&lt;h3&gt;
  
  
  🤖 Model and Service Selection
&lt;/h3&gt;

&lt;p&gt;We picked tools that balance &lt;strong&gt;quality&lt;/strong&gt;, &lt;strong&gt;latency&lt;/strong&gt;, and &lt;strong&gt;cost-effectiveness&lt;/strong&gt; for production-level use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LiveKit&lt;/strong&gt;: to orchestrate real-time audio and SIP integration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deepgram Nova-2 (Enterprise)&lt;/strong&gt;: for fast, accurate streaming transcription with low per-minute cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4.1 mini&lt;/strong&gt;: a lightweight, affordable OpenAI model that still delivers strong reasoning and fluency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text-to-Speech&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ElevenLabs (Business plan)&lt;/strong&gt;: premium voices with emotional range and expressiveness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cartesia (Scale plan)&lt;/strong&gt;: lower-cost alternative for simpler use cases.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;SIP Trunking&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Twilio&lt;/strong&gt;: simple and widely used.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telnyx&lt;/strong&gt;: cost-competitive with flexible routing.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;This setup lets us explore the full range of pricing scenarios—from economy stacks to premium voice experiences—while keeping the core system consistent.&lt;/p&gt;

&lt;h2&gt;
  
  
  4) LiveKit — Orchestrating the Audio Layer
&lt;/h2&gt;

&lt;p&gt;At the center of our voice pipeline is &lt;strong&gt;LiveKit&lt;/strong&gt;, which handles real-time audio routing between the SIP trunk, transcription, TTS, and back to the caller. It’s the glue that makes low-latency, bidirectional communication possible.&lt;/p&gt;

&lt;p&gt;LiveKit offers a generous &lt;strong&gt;Scale Plan&lt;/strong&gt; designed for production workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base cost:&lt;/strong&gt; $500/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Includes:&lt;/strong&gt; 45,000 minutes of usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overage rate:&lt;/strong&gt; $0.003 per additional minute&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔢 Usage Breakdown
&lt;/h3&gt;

&lt;p&gt;With 22,000 calls per month at 3 minutes each, we consume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;66,000 total minutes&lt;/strong&gt; of LiveKit usage (audio flowing in and out).&lt;/li&gt;
&lt;li&gt;We exceed the plan’s included minutes by &lt;strong&gt;21,000 minutes&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Overage cost: 21,000 × $0.003 = &lt;strong&gt;$63&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  💰 Total Monthly Cost for LiveKit
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Base plan: $500&lt;/li&gt;
&lt;li&gt;Overage: $63&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: $563/month&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🧮 Cost per Unit
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$563 ÷ 22,000 = &lt;strong&gt;$0.0256&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per hour&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20 calls/hour → 20 × $0.0256 = &lt;strong&gt;$0.512&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LiveKit’s pricing scales linearly with usage and is well-suited for handling concurrent calls without needing to manage media servers manually.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pricing based on public rates as of &lt;strong&gt;May 2025&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5) Speech-to-Text — Transcribing the Human Side with Deepgram
&lt;/h2&gt;

&lt;p&gt;To understand what the caller says, we need fast and accurate transcription. For this, we use &lt;strong&gt;Deepgram&lt;/strong&gt;, a popular speech-to-text (STT) provider known for real-time streaming, multilingual support, and competitive enterprise pricing.&lt;/p&gt;

&lt;p&gt;We selected the &lt;strong&gt;Nova-2 model (Enterprise tier)&lt;/strong&gt; for its balance of speed, accuracy, and affordability.&lt;/p&gt;

&lt;h3&gt;
  
  
  🎧 Why Streaming Matters
&lt;/h3&gt;

&lt;p&gt;In a voice AI pipeline, latency is critical. Transcription must happen as the user speaks—not after they’ve finished—so the AI can respond naturally in near real-time.&lt;/p&gt;

&lt;p&gt;Streaming STT allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower response times (smoother dialogue)&lt;/li&gt;
&lt;li&gt;Efficient token handling for the language model&lt;/li&gt;
&lt;li&gt;Better support for interruptions and barge-in behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deepgram’s Nova-2 model supports all of this out of the box.&lt;/p&gt;

&lt;h3&gt;
  
  
  💰 Enterprise Pricing (Nova-2)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rate:&lt;/strong&gt; $0.0047 per minute (Enterprise tier, streaming)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Applicable usage:&lt;/strong&gt; Only transcribing the &lt;strong&gt;human side&lt;/strong&gt; of the call (50% of total time)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monthly usage:&lt;/strong&gt; 22,000 calls × 1.5 min (caller talk time) = &lt;strong&gt;33,000 minutes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🧮 Cost Breakdown
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1.5 min × $0.0047 = &lt;strong&gt;$0.00705&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per hour&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20 calls/hour → 20 × $0.00705 = &lt;strong&gt;$0.141&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total per month (10 agents)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;33,000 min × $0.0047 = &lt;strong&gt;$155.10&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This STT layer is one of the more affordable components of the pipeline, thanks to Deepgram’s usage-based pricing and efficient streaming infrastructure.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pricing based on public Enterprise rates as of &lt;strong&gt;May 2025&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  6) Language Model — Token Costs with GPT-4.1 mini
&lt;/h2&gt;

&lt;p&gt;The core of any conversational AI system is the language model that generates responses. In our setup, we use &lt;strong&gt;OpenAI’s GPT-4.1 mini&lt;/strong&gt;, which offers a great trade-off between intelligence, latency, and price.&lt;/p&gt;

&lt;p&gt;Unlike flat-rate pricing, token billing in GPT models varies depending on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Input tokens&lt;/strong&gt; (the prompt + conversation history)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output tokens&lt;/strong&gt; (the generated reply)&lt;/li&gt;
&lt;li&gt;Whether input tokens are &lt;strong&gt;cached&lt;/strong&gt; (like a static system prompt) or &lt;strong&gt;non-cached&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s break that down.&lt;/p&gt;




&lt;h3&gt;
  
  
  📥 Input Tokens
&lt;/h3&gt;

&lt;p&gt;Each user message builds on previous context. So as the conversation progresses, the number of input tokens increases with every new request.&lt;/p&gt;

&lt;p&gt;For a 3-minute call with &lt;strong&gt;6 back-and-forth exchanges&lt;/strong&gt;, we estimate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System prompt:&lt;/strong&gt; ~2,000 tokens (sent with every request, but cacheable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation context:&lt;/strong&gt; grows each turn (averaging 975 tokens total)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total input tokens per call:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cached input:&lt;/strong&gt; 6 × 2,000 = 12,000 tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-cached input:&lt;/strong&gt; ~975 tokens&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  📤 Output Tokens
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The model responds 6 times (one per turn), with ~35 tokens per reply&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Total output tokens per call:&lt;/strong&gt; ~210 tokens&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  💰 GPT-4.1 mini Pricing (May 2025)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Token Type&lt;/th&gt;
&lt;th&gt;Rate per million&lt;/th&gt;
&lt;th&gt;Per-token cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cached input&lt;/td&gt;
&lt;td&gt;$0.10 / 1M&lt;/td&gt;
&lt;td&gt;$0.00000010&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regular input&lt;/td&gt;
&lt;td&gt;$0.40 / 1M&lt;/td&gt;
&lt;td&gt;$0.00000040&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;$1.60 / 1M&lt;/td&gt;
&lt;td&gt;$0.00000160&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  🧮 Cost Breakdown (Per Call)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Token Type&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cached input&lt;/td&gt;
&lt;td&gt;12,000&lt;/td&gt;
&lt;td&gt;$0.10 / 1M&lt;/td&gt;
&lt;td&gt;$0.00120&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-cached input&lt;/td&gt;
&lt;td&gt;975&lt;/td&gt;
&lt;td&gt;$0.40 / 1M&lt;/td&gt;
&lt;td&gt;$0.00039&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output&lt;/td&gt;
&lt;td&gt;210&lt;/td&gt;
&lt;td&gt;$1.60 / 1M&lt;/td&gt;
&lt;td&gt;$0.000336&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.001926&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  📊 Aggregate Costs
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.001926&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per hour&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20 × $0.001926 = &lt;strong&gt;$0.0385&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total per month (10 agents)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;22,000 × $0.001926 = &lt;strong&gt;$42.37&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;GPT-4.1 mini’s pricing structure rewards careful prompt engineering and context management. While the per-call cost is low, the input token growth curve makes it important to minimize unnecessary repetition in multi-turn conversations (e.g. trimming old exchanges or summarizing history).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pricing based on OpenAI’s GPT-4.1 mini public rates as of &lt;strong&gt;May 2025&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  7) Text-to-Speech — Choosing Between ElevenLabs and Cartesia
&lt;/h2&gt;

&lt;p&gt;For the AI to respond naturally, we need to convert the model’s text output into high-quality speech. In our analysis, we compared two leading &lt;strong&gt;Text-to-Speech (TTS)&lt;/strong&gt; providers: &lt;strong&gt;ElevenLabs&lt;/strong&gt; and &lt;strong&gt;Cartesia&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Both platforms deliver excellent results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🗣️ &lt;strong&gt;Multilingual support&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🎭 &lt;strong&gt;Voice cloning capabilities&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Optimized for low-latency streaming&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key differences lie in pricing and voice variety.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧬 ElevenLabs (Business Plan)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extensive voice library&lt;/strong&gt;, including highly expressive and emotional voices.&lt;/li&gt;
&lt;li&gt;Well-suited for customer-facing conversations where tone and nuance matter.&lt;/li&gt;
&lt;li&gt;Plan includes &lt;strong&gt;22,000 minutes&lt;/strong&gt; for &lt;strong&gt;$1,320/month&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overage minutes&lt;/strong&gt; cost $0.06/min.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We need &lt;strong&gt;33,000 minutes&lt;/strong&gt; per month (AI speaks ~50% of every call):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Base: $1,320
&lt;/li&gt;
&lt;li&gt;Overage: 11,000 × $0.06 = $660
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: $1,980/month&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🧪 Cartesia (Scale Plan)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Smaller voice library, but &lt;strong&gt;still multilingual and highly intelligible&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;More cost-effective for less expressive use cases.&lt;/li&gt;
&lt;li&gt;Estimated cost: &lt;strong&gt;$0.0299/min&lt;/strong&gt; under the Scale plan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Monthly cost for 33,000 minutes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;33,000 × $0.0299 ≈ &lt;strong&gt;$986.70/month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🧮 Cost Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Per Call&lt;/th&gt;
&lt;th&gt;Per Hour&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ElevenLabs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1,980/month (blended)&lt;/td&gt;
&lt;td&gt;$1,980&lt;/td&gt;
&lt;td&gt;$0.09&lt;/td&gt;
&lt;td&gt;$1.80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cartesia&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.0299/min&lt;/td&gt;
&lt;td&gt;$986.70&lt;/td&gt;
&lt;td&gt;$0.0449&lt;/td&gt;
&lt;td&gt;$0.897&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  🎯 Which One Should You Use?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Choose &lt;strong&gt;ElevenLabs&lt;/strong&gt; if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want &lt;strong&gt;high voice fidelity&lt;/strong&gt;, emotional range, or public-facing use.&lt;/li&gt;
&lt;li&gt;You care about building brand consistency with &lt;strong&gt;custom voices&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;Choose &lt;strong&gt;Cartesia&lt;/strong&gt; if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You’re optimizing for &lt;strong&gt;cost and speed&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Expressiveness is less critical (e.g. follow-up reminders, routing flows).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Both providers are strong technically, with &lt;strong&gt;low-latency streaming&lt;/strong&gt;, voice cloning, and multilingual support out of the box.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Pricing based on public rates as of &lt;strong&gt;May 2025&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  8) SIP Trunking — Connecting to the Phone Network (Twilio vs Telnyx)
&lt;/h2&gt;

&lt;p&gt;To make and receive real phone calls, we need to connect our voice-AI system to the &lt;strong&gt;PSTN (Public Switched Telephone Network)&lt;/strong&gt;. This is where &lt;strong&gt;SIP trunking&lt;/strong&gt; comes in. It acts as the bridge between the internet and traditional phone numbers.&lt;/p&gt;

&lt;p&gt;In our setup, we evaluated two leading providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Twilio&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Telnyx&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both integrate seamlessly with &lt;strong&gt;LiveKit&lt;/strong&gt;, enabling bi-directional SIP call routing with support for outbound and inbound audio streams.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔁 Understanding the Billing: Origination vs Termination
&lt;/h3&gt;

&lt;p&gt;SIP trunking costs are typically split into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Termination&lt;/strong&gt; — outbound calls (your AI calls a user)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Origination&lt;/strong&gt; — inbound calls (users call your AI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phone number rental&lt;/strong&gt; — flat monthly rate per number&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this analysis, we assume &lt;strong&gt;outbound calling to U.S. local numbers&lt;/strong&gt; (the AI initiates the conversation).&lt;/p&gt;




&lt;h3&gt;
  
  
  💰 Cost Comparison: Twilio vs Telnyx
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Twilio&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Telnyx&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Termination (outbound)&lt;/td&gt;
&lt;td&gt;$0.0011/min&lt;/td&gt;
&lt;td&gt;$0.0050/min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Origination (inbound)&lt;/td&gt;
&lt;td&gt;$0.0034/min&lt;/td&gt;
&lt;td&gt;$0.0035/min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total per minute&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0045/min&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$0.0085/min&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost per 3-min call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.0135&lt;/td&gt;
&lt;td&gt;$0.0255&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost per hour (20 calls)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.27&lt;/td&gt;
&lt;td&gt;$0.51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monthly cost (22,000 calls)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$297.00&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$561.00&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: Phone number rental (e.g. $1.15/month for a local number) is a small fixed cost and not included here, since it’s negligible at volume.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  📌 Summary
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Twilio&lt;/strong&gt; is more cost-effective at lower scale, with highly transparent pricing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telnyx&lt;/strong&gt; offers flexibility, more control over routing, and competitive rates at higher volumes, especially for international calls.&lt;/li&gt;
&lt;li&gt;Both are easy to integrate with &lt;strong&gt;LiveKit SIP features&lt;/strong&gt;, making them suitable choices depending on your cost or feature preferences.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Pricing based on public SIP trunking rates as of &lt;strong&gt;May 2025&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Putting It All Together — Full Stack Cost Comparison
&lt;/h2&gt;

&lt;p&gt;Now that we’ve broken down each component, let’s summarize the total cost of running a fully orchestrated voice AI system. We'll compare two realistic deployment stacks:&lt;/p&gt;

&lt;h3&gt;
  
  
  🟢 Economy Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TTS:&lt;/strong&gt; Cartesia (Scale plan)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SIP Trunking:&lt;/strong&gt; Twilio&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;STT:&lt;/strong&gt; Deepgram (Nova-2 Enterprise)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM:&lt;/strong&gt; GPT-4.1 mini&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio Orchestration:&lt;/strong&gt; LiveKit (Scale plan)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔵 Premium Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TTS:&lt;/strong&gt; ElevenLabs (Business plan + overage)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SIP Trunking:&lt;/strong&gt; Telnyx&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;STT:&lt;/strong&gt; Deepgram (Nova-2 Enterprise)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM:&lt;/strong&gt; GPT-4.1 mini&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio Orchestration:&lt;/strong&gt; LiveKit (Scale plan)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  💵 Cost Comparison Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Economy Stack&lt;/th&gt;
&lt;th&gt;Premium Stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LiveKit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$563.00&lt;/td&gt;
&lt;td&gt;$563.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;STT (Deepgram)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$155.10&lt;/td&gt;
&lt;td&gt;$155.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM (GPT-4.1 mini)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$42.37&lt;/td&gt;
&lt;td&gt;$42.37&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TTS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$986.70 (Cartesia)&lt;/td&gt;
&lt;td&gt;$1,980.00 (11Labs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SIP Trunking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$297.00 (Twilio)&lt;/td&gt;
&lt;td&gt;$561.00 (Telnyx)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TOTAL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$2,044.17&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$3,301.47&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  🧮 Unit Economics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Economy Stack&lt;/th&gt;
&lt;th&gt;Premium Stack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per call&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.0929&lt;/td&gt;
&lt;td&gt;$0.1500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Per hour (20 calls)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$1.86&lt;/td&gt;
&lt;td&gt;$3.00&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  🏆 Which Stack Wins?
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Economy Stack&lt;/strong&gt; clearly offers &lt;strong&gt;substantial savings&lt;/strong&gt;, making it a great choice for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-volume, low-complexity workflows&lt;/li&gt;
&lt;li&gt;Prototypes or early-stage deployments&lt;/li&gt;
&lt;li&gt;Use cases where expressive TTS is not critical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Meanwhile, the &lt;strong&gt;Premium Stack&lt;/strong&gt; is ideal when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caller experience and vocal quality are top priorities&lt;/li&gt;
&lt;li&gt;You need branded voices or enhanced emotional range&lt;/li&gt;
&lt;li&gt;You're targeting sensitive, trust-critical interactions (e.g., healthcare, finance)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both stacks are production-ready, but the &lt;strong&gt;Economy Stack costs ~38% less per call&lt;/strong&gt;, making it the winner in terms of operational efficiency.&lt;/p&gt;




&lt;h3&gt;
  
  
  📊 Visual Overview - Cost Comparison Bar Chart (Monthly Total and Per Call)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Total Monthly Cost (USD)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Cost (USD)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Economy Stack&lt;/td&gt;
&lt;td&gt;$2,044&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Premium Stack&lt;/td&gt;
&lt;td&gt;$3,301&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Cost Per Call (USD)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Cost Per Call&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Economy Stack&lt;/td&gt;
&lt;td&gt;$0.093&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Premium Stack&lt;/td&gt;
&lt;td&gt;$0.150&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: All prices reflect public rates as of &lt;strong&gt;May 2025&lt;/strong&gt;. Each component exceeds the highest pricing tier currently listed, so &lt;strong&gt;enterprise-level negotiation is likely to yield 30–50% discounts&lt;/strong&gt; when deployed at scale.&lt;br&gt;
With those discounts, the Economy Stack could drop below &lt;strong&gt;$1,500/month&lt;/strong&gt;, and the Premium Stack below &lt;strong&gt;$2,300/month&lt;/strong&gt;, making large-scale deployment increasingly feasible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  10) Negotiating Beyond Public Pricing Tiers
&lt;/h2&gt;

&lt;p&gt;At the scale we’re modeling—&lt;strong&gt;22,000 calls per month&lt;/strong&gt;, totaling &lt;strong&gt;66,000 minutes of voice&lt;/strong&gt;, &lt;strong&gt;33,000 minutes of TTS&lt;/strong&gt;, and &lt;strong&gt;33,000 minutes of transcription&lt;/strong&gt;—&lt;strong&gt;every major component of the stack exceeds the highest publicly available pricing tier&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LiveKit&lt;/strong&gt; (Scale plan: 45,000 min included → we use 66,000)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deepgram&lt;/strong&gt; (Enterprise pricing already applies)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ElevenLabs&lt;/strong&gt; (Business plan includes 22,000 min → we use 33,000)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cartesia&lt;/strong&gt; (Scale plan rates exceeded)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Twilio / Telnyx&lt;/strong&gt; (volume usage beyond typical pay-as-you-go)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI GPT-4.1 mini&lt;/strong&gt; (high token volume, consistent monthly usage)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🧾 Why Enterprise Negotiation Matters
&lt;/h3&gt;

&lt;p&gt;When your usage becomes predictable and high-volume, vendors are often open to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Committed-use discounts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Volume-based pricing tiers&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bundled service contracts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom SLAs and support&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Discounts in the &lt;strong&gt;30%–50% range&lt;/strong&gt; are realistic, especially when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You negotiate multi-month or annual commitments&lt;/li&gt;
&lt;li&gt;You consolidate services under a single provider&lt;/li&gt;
&lt;li&gt;You become a reference customer or provide product feedback&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  💸 Recalculated Costs with ~40% Discount
&lt;/h3&gt;

&lt;p&gt;Applying a &lt;strong&gt;conservative 40% discount&lt;/strong&gt; across the stack:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack Type&lt;/th&gt;
&lt;th&gt;Full Price (Monthly)&lt;/th&gt;
&lt;th&gt;After Discount (–40%)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Economy Stack&lt;/td&gt;
&lt;td&gt;$2,044.17&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,226.50&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Premium Stack&lt;/td&gt;
&lt;td&gt;$3,301.47&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$1,980.88&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These adjusted prices bring the &lt;strong&gt;cost per call&lt;/strong&gt; down to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Economy Stack:&lt;/strong&gt; ~$0.056&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premium Stack:&lt;/strong&gt; ~$0.090&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And &lt;strong&gt;cost per hour&lt;/strong&gt; down to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Economy Stack:&lt;/strong&gt; ~$1.12&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Premium Stack:&lt;/strong&gt; ~$1.80&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ✅ Final Takeaway
&lt;/h3&gt;

&lt;p&gt;If you’re planning to scale voice-AI automation beyond a few thousand calls per month, don’t rely solely on self-serve pricing pages. Reach out to each vendor’s enterprise sales team—you may unlock significant savings that make production-scale deployment much more cost-effective than it initially appears.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;All cost assumptions based on publicly available pricing as of &lt;strong&gt;May 2025&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  11) Operational Tips &amp;amp; Optimizations
&lt;/h2&gt;

&lt;p&gt;Once your voice-AI system is up and running, there are several strategies you can apply to reduce costs, improve performance, and make the whole experience smoother—without sacrificing quality.&lt;/p&gt;

&lt;p&gt;Here are some of the most effective optimizations:&lt;/p&gt;




&lt;h3&gt;
  
  
  🧠 1. Trim the Token Window
&lt;/h3&gt;

&lt;p&gt;Language model input costs scale with conversation history. Instead of sending the full transcript on every turn:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Summarize earlier turns&lt;/strong&gt; into compact memory chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove low-value exchanges&lt;/strong&gt; like “OK,” “Sure,” or greetings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use windowing strategies&lt;/strong&gt; (e.g., keep the last 3–4 turns only).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This helps reduce input token usage, especially in longer conversations.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔇 2. Silence Trimming &amp;amp; Voice Activity Detection (VAD)
&lt;/h3&gt;

&lt;p&gt;Avoid processing and transcribing empty audio:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;Voice Activity Detection&lt;/strong&gt; to skip long silences or background noise.&lt;/li&gt;
&lt;li&gt;Trim pauses before sending audio to STT or TTS services.&lt;/li&gt;
&lt;li&gt;Detect &lt;strong&gt;barge-ins&lt;/strong&gt; (caller interrupts the bot) to pause TTS playback early.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces billed minutes on both STT and TTS sides.&lt;/p&gt;




&lt;h3&gt;
  
  
  🧾 3. Cache the System Prompt
&lt;/h3&gt;

&lt;p&gt;OpenAI allows &lt;strong&gt;cached input tokens&lt;/strong&gt; (like a static system prompt) at a much lower rate. Make sure you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep your &lt;strong&gt;system prompt constant across requests&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Use API options that take advantage of caching when possible.&lt;/li&gt;
&lt;li&gt;Avoid resending unchanged instructions as raw text.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  💬 4. Pre-generate Common Replies
&lt;/h3&gt;

&lt;p&gt;For deterministic workflows (like confirming an appointment or collecting a yes/no), you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;pre-written text responses&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Skip the language model entirely for predictable branches&lt;/li&gt;
&lt;li&gt;Cut latency and token cost to zero on those turns&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  📉 5. Committed-Use Agreements
&lt;/h3&gt;

&lt;p&gt;Once your usage stabilizes, talk to each vendor about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Volume discounts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Annual billing options&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom usage tiers&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Vendors are often willing to negotiate lower prices when you commit to consistent usage or bundle multiple services.&lt;/p&gt;




&lt;h3&gt;
  
  
  🛠️ Bonus: Monitor &amp;amp; Adapt in Real Time
&lt;/h3&gt;

&lt;p&gt;Use analytics and observability tools (like SIP Insights, LiveKit metrics, or transcription confidence scores) to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spot anomalies (long silences, error spikes, dropped calls)&lt;/li&gt;
&lt;li&gt;Optimize system behavior dynamically&lt;/li&gt;
&lt;li&gt;Choose which interactions need human handoff&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;By applying even a few of these strategies, you can significantly reduce operational costs, improve response times, and deliver a more professional and polished AI voice experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  12) Conclusion — When the Numbers Make Sense, and When the Voice Matters
&lt;/h2&gt;

&lt;p&gt;Automating voice workflows isn’t about replacing people—it's about taking the repetitive, high-frequency interactions off their plates so they can focus on more meaningful work. With the right architecture and cost controls in place, voice-AI agents can handle thousands of predictable conversations efficiently and affordably.&lt;/p&gt;

&lt;h3&gt;
  
  
  📊 The Break-Even Point
&lt;/h3&gt;

&lt;p&gt;At roughly &lt;strong&gt;$0.056–$0.09 per call&lt;/strong&gt; (with enterprise pricing), you can simulate the monthly output of &lt;strong&gt;10 full-time agents&lt;/strong&gt; for &lt;strong&gt;$1,200–$2,000/month&lt;/strong&gt;. Depending on your geography, staffing model, and call volume, that’s often below the cost of a single human operator.&lt;/p&gt;

&lt;p&gt;This makes voice automation compelling for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lead qualification&lt;/li&gt;
&lt;li&gt;Appointment reminders&lt;/li&gt;
&lt;li&gt;Customer surveys&lt;/li&gt;
&lt;li&gt;Payment follow-ups&lt;/li&gt;
&lt;li&gt;Routine inbound routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Especially when those interactions follow predictable patterns or scripted flows.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔬 Where to Experiment Next
&lt;/h3&gt;

&lt;p&gt;If you're considering deploying your own voice AI assistant, the next logical steps might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Testing real customer calls with different TTS providers&lt;/li&gt;
&lt;li&gt;Measuring drop-off rates and call completion times&lt;/li&gt;
&lt;li&gt;A/B testing voice styles or model temperatures&lt;/li&gt;
&lt;li&gt;Monitoring cost per resolved interaction over time&lt;/li&gt;
&lt;li&gt;Integrating fallback routes for complex queries (human transfer, async follow-up)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Voice automation is no longer experimental—it's becoming operational. With the right balance of cost, quality, and control, you can build something that not only saves time but feels genuinely helpful to the people on the other end of the line.&lt;/p&gt;

&lt;h2&gt;
  
  
  13) Resources &amp;amp; Links
&lt;/h2&gt;

&lt;p&gt;Here’s a list of all the official pricing and documentation pages for the tools and platforms referenced throughout this article. You can refer to these for the latest rates, usage limits, and API capabilities:&lt;/p&gt;

&lt;h3&gt;
  
  
  🔷 LiveKit
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- [LiveKit Pricing](https://livekit.io/pricing)
- [LiveKit Docs](https://docs.livekit.io/home/)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  🔷 Deepgram (Speech-to-Text)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- [Deepgram Pricing](https://deepgram.com/pricing)  
- [Deepgram API Docs](https://developers.deepgram.com/home/introduction)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  🔷 OpenAI (GPT-4.1 mini)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- [OpenAI Pricing](https://openai.com/api/pricing/)  
- [OpenAI API Docs](https://platform.openai.com/docs/overview)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  🔷 ElevenLabs (Text-to-Speech)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- [ElevenLabs Pricing](https://elevenlabs.io/pricing/api)  
- [ElevenLabs Docs](https://elevenlabs.io/docs/overview)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  🔷 Cartesia
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- [Cartesia Pricing](https://cartesia.ai/pricing)  
- [Cartesia API Docs](https://docs.cartesia.ai/2024-11-13/get-started/overview)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  🔷 Twilio (SIP Trunking)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- [Twilio SIP Pricing](https://www.twilio.com/en-us/sip-trunking/pricing/us)  
- [Twilio Docs](https://www.twilio.com/docs/sip-trunking)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  🔷 Telnyx (SIP Trunking)
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- [Telnyx SIP Pricing](https://telnyx.com/pricing/elastic-sip)  
- [Telnyx Docs](https://developers.telnyx.com/)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>voice</category>
      <category>cost</category>
      <category>analysis</category>
    </item>
    <item>
      <title>Cracking the &lt; 1-second Voice Loop: What We Learned After 30+ Stack Benchmarks</title>
      <dc:creator>Roman Piacquadio</dc:creator>
      <pubDate>Mon, 19 May 2025 15:09:52 +0000</pubDate>
      <link>https://dev.to/cloudx/cracking-the-1-second-voice-loop-what-we-learned-after-30-stack-benchmarks-427</link>
      <guid>https://dev.to/cloudx/cracking-the-1-second-voice-loop-what-we-learned-after-30-stack-benchmarks-427</guid>
      <description>&lt;h2&gt;
  
  
  Introduction — Why We’re Racing for &lt;em&gt;Sub-Second&lt;/em&gt; Voice Loops
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;October 2024&lt;/strong&gt; OpenAI unveiled its &lt;strong&gt;Realtime API&lt;/strong&gt;, the first end-to-end &lt;strong&gt;multimodal model&lt;/strong&gt; able to convert speech → text → reasoning → speech fast enough to feel &lt;em&gt;human&lt;/em&gt;.&lt;br&gt;&lt;br&gt;
That launch set the &lt;strong&gt;hype machine&lt;/strong&gt; spinning: “Why bother wiring three engines together when a single neural giant can do voice-to-voice in one shot?”&lt;/p&gt;

&lt;p&gt;Reality check:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pain Point&lt;/th&gt;
&lt;th&gt;Real-time Voice API&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~&lt;strong&gt;$20/hour&lt;/strong&gt; of two-way conversation — rough for contact-center scale.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Voices&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Locked to a handful of OpenAI-curated timbres; no custom cloning or branded voices.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Swapability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You wait for &lt;em&gt;their&lt;/em&gt; next model drop — can’t plug in a brand-new STT or TTS that shipped yesterday.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Meanwhile, the open-source and vendor ecosystem didn’t sit still. By mid-2025 we could stitch together &lt;strong&gt;Deepgram STT + GPT-4 Nano/Mini + Cartesia Sonic (or ElevenLabs)&lt;/strong&gt; and hit &lt;em&gt;similar&lt;/em&gt; latency &lt;strong&gt;for a fraction of the cost&lt;/strong&gt; — while choosing any voice we like.&lt;/p&gt;

&lt;p&gt;The trick is to keep every stage &lt;strong&gt;modular&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speech-to-Text (STT)&lt;/strong&gt; — use whatever recognizer is fastest or cheapest today.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large Language Model (LLM)&lt;/strong&gt; — swap Mini ↔ Nano ↔ Flash checkpoints as they evolve.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text-to-Speech (TTS)&lt;/strong&gt; — pick the voice library that matches your brand.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Enter &lt;strong&gt;&lt;a href="https://livekit.io" rel="noopener noreferrer"&gt;LiveKit&lt;/a&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The glue that lets us shuffle those &lt;strong&gt;building blocks&lt;/strong&gt; in real time is &lt;strong&gt;LiveKit&lt;/strong&gt; — a WebRTC orchestration layer with an SDK that can fan-out telephone legs, browser streams, and AI workers on the same SFU.&lt;/p&gt;

&lt;p&gt;New STT, LLM, or TTS drops on a Friday?&lt;br&gt;&lt;br&gt;
We just &lt;strong&gt;swap the block&lt;/strong&gt;, &lt;strong&gt;restart the worker&lt;/strong&gt;, and it's live by lunch.&lt;/p&gt;

&lt;p&gt;No retraining. No monolithic rebuilds. Just composable parts evolving at their own pace.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “Latency” Really Means (and Why It Hurts)
&lt;/h2&gt;

&lt;p&gt;Human turn-taking is &lt;em&gt;fast&lt;/em&gt;. Large-scale multilingual studies show that the &lt;strong&gt;median inter-turn gap is ≈ 200 ms&lt;/strong&gt;, but the range spans from as low as &lt;strong&gt;7 ms&lt;/strong&gt; (in Japanese) to over &lt;strong&gt;440 ms&lt;/strong&gt; (in Danish), depending on the language, sentence structure, and context of the exchange &lt;a href="https://arxiv.org/pdf/2404.16053v1" rel="noopener noreferrer"&gt;[1]&lt;/a&gt;.&lt;br&gt;&lt;br&gt;
A replication focused on English measured an average gap of &lt;strong&gt;236 ms ± 520 ms SD&lt;/strong&gt;, confirming that even within a single language, there’s wide variance depending on interaction type and formality.&lt;/p&gt;

&lt;p&gt;When the silence between turns stretches, our perception shifts:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;One-way gap&lt;/th&gt;
&lt;th&gt;How it feels&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; ≈ 400 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Still “natural”, but you notice a beat.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;gt; ≈ 400 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ITU-T G.114 flags this as &lt;em&gt;unacceptable&lt;/em&gt; for conversational quality.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;gt; ≈ 600–700 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Most people label the call “robotic” or “satellite-delayed”.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These reference points form the benchmark we’re chasing:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;get the bot’s first syllable inside the ~400 ms comfort zone&lt;/strong&gt;—or, at the very least, close enough that the pause doesn’t break the conversational rhythm.&lt;/p&gt;




&lt;h2&gt;
  
  
  Anatomy of a Voice Pipeline
&lt;/h2&gt;

&lt;p&gt;A real-time loop has &lt;strong&gt;three streaming stages&lt;/strong&gt; that run strictly in series:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Latency metric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;STT – Speech-to-Text&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Turns audio frames into text tokens.&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Final transcript time&lt;/em&gt; (but with proper streaming this is ≈ 0 ms relative to the next stage).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM – Large Language Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Crafts the reply.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TTFT (Time to First Token):&lt;/strong&gt; delay between sending the prompt and receiving the &lt;em&gt;first&lt;/em&gt; generated token.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TTS – Text-to-Speech&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Voices the reply.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;TTFB (Time to First Byte):&lt;/strong&gt; delay between sending the text and receiving the first playable PCM chunk.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key observation:&lt;/strong&gt; in every stack we measured, &lt;strong&gt;LLM TTFT + TTS TTFB account for 90 %+ of total loop time&lt;/strong&gt;; with streaming recognizers, STT is effectively negligible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All three stages run in streaming inference — we start passing tokens or audio frames downstream the moment we see them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Latency / Quality / Cost Triangle
&lt;/h2&gt;

&lt;p&gt;Push one corner, the others move:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lower latency ⇢&lt;/strong&gt; smaller / quantized models, “good-enough” neural voices.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher quality ⇢&lt;/strong&gt; bigger LLMs, premium TTS; usually slower.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lower cost ⇢&lt;/strong&gt; open-source or micro-models; may ding both speed &lt;em&gt;and&lt;/em&gt; fidelity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our job is to find the &lt;em&gt;quickest&lt;/em&gt; loop that still sounds customer-ready and doesn’t torch the budget.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6m969l4eot7v7un154x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6m969l4eot7v7un154x.png" alt="Latency / Quality / Cost Triangle" width="307" height="307"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How We Benchmarked
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Same system prompt in &lt;strong&gt;English&lt;/strong&gt; &lt;em&gt;and&lt;/em&gt; &lt;strong&gt;Spanish&lt;/strong&gt;.
&lt;/li&gt;
&lt;li&gt;Dozens of &lt;strong&gt;STT + LLM + TTS combinations&lt;/strong&gt; (cloud &amp;amp; OSS of which we have selected the top performing).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiveKit&lt;/strong&gt; measured STT duration, TTFT, TTFB on every turn.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  A few things we learned fast
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;LLMs &amp;amp; TTS slow down outside English.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A long system prompt only punishes the &lt;strong&gt;first&lt;/strong&gt; turn (~ +300 ms); later turns ride the KV-cache.
&lt;/li&gt;
&lt;li&gt;The newest “nano” LLMs plus an ultra-fast TTS can get that &lt;strong&gt;first syllable under 800 ms&lt;/strong&gt;, scraping the human comfort ceiling.&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;STT&lt;/th&gt;
&lt;th&gt;LLM (version)&lt;/th&gt;
&lt;th&gt;TTS&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;TTFT (1st / next)&lt;/th&gt;
&lt;th&gt;TTS TTFB&lt;/th&gt;
&lt;th&gt;First Byte Latency*&lt;/th&gt;
&lt;th&gt;Tokens/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Whisper-1 (no stream)&lt;/td&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;0.34 / 0.34 s&lt;/td&gt;
&lt;td&gt;0.42–0.47 s&lt;/td&gt;
&lt;td&gt;3.1–3.9 s&lt;/td&gt;
&lt;td&gt;19–48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;0.31–1.63 / 0.31–0.45 s&lt;/td&gt;
&lt;td&gt;0.35–0.46 s&lt;/td&gt;
&lt;td&gt;0.7–2.1 s&lt;/td&gt;
&lt;td&gt;9–23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-mini&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;0.31–0.44 / 0.31–0.40 s&lt;/td&gt;
&lt;td&gt;0.40–0.59 s&lt;/td&gt;
&lt;td&gt;0.71–1.03 s&lt;/td&gt;
&lt;td&gt;13–67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-mini&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;ES&lt;/td&gt;
&lt;td&gt;0.77–1.33 / 0.75–0.95 s&lt;/td&gt;
&lt;td&gt;0.56–0.69 s&lt;/td&gt;
&lt;td&gt;1.33–2.02 s&lt;/td&gt;
&lt;td&gt;29–38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;Gemini 1.5 Flash&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;0.45–0.76 / 0.35–0.55 s&lt;/td&gt;
&lt;td&gt;0.45–0.70 s&lt;/td&gt;
&lt;td&gt;1.2–1.5 s&lt;/td&gt;
&lt;td&gt;40–85&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;Gemini 1.5 Flash&lt;/td&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;ES&lt;/td&gt;
&lt;td&gt;1.30–2.37 / 1.10–1.40 s&lt;/td&gt;
&lt;td&gt;0.46–0.69 s&lt;/td&gt;
&lt;td&gt;1.8–3.0 s&lt;/td&gt;
&lt;td&gt;25–58&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-mini&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-2&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;1.22–1.41 / 0.42–0.90 s&lt;/td&gt;
&lt;td&gt;0.43–0.45 s&lt;/td&gt;
&lt;td&gt;1.65–1.86 s&lt;/td&gt;
&lt;td&gt;23–46&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-mini&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-2&lt;/td&gt;
&lt;td&gt;ES&lt;/td&gt;
&lt;td&gt;0.74–1.38 / 0.70–0.90 s&lt;/td&gt;
&lt;td&gt;0.48–0.52 s&lt;/td&gt;
&lt;td&gt;1.22–1.90 s&lt;/td&gt;
&lt;td&gt;22–42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-mini&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-Turbo&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;1.15–1.24 / 0.44–0.65 s&lt;/td&gt;
&lt;td&gt;0.38–0.41 s&lt;/td&gt;
&lt;td&gt;1.53–1.65 s&lt;/td&gt;
&lt;td&gt;17–45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-mini&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-Turbo&lt;/td&gt;
&lt;td&gt;ES&lt;/td&gt;
&lt;td&gt;0.75–1.11 / 0.30–0.40 s&lt;/td&gt;
&lt;td&gt;0.43–0.46 s&lt;/td&gt;
&lt;td&gt;1.18–1.57 s&lt;/td&gt;
&lt;td&gt;31–51&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;Gemini 1.5 Flash&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-Turbo&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;1.19–1.27 / 1.19–1.27 s&lt;/td&gt;
&lt;td&gt;0.40–0.43 s&lt;/td&gt;
&lt;td&gt;1.59–1.70 s&lt;/td&gt;
&lt;td&gt;12–44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;Gemini 1.5 Flash&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-Turbo&lt;/td&gt;
&lt;td&gt;ES&lt;/td&gt;
&lt;td&gt;1.28–1.39 / 1.00–1.10 s&lt;/td&gt;
&lt;td&gt;0.42–0.44 s&lt;/td&gt;
&lt;td&gt;1.70–1.83 s&lt;/td&gt;
&lt;td&gt;40–56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-nano&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-Turbo&lt;/td&gt;
&lt;td&gt;EN&lt;/td&gt;
&lt;td&gt;0.90–0.97 / 0.30–0.40 s&lt;/td&gt;
&lt;td&gt;0.42–0.52 s&lt;/td&gt;
&lt;td&gt;0.73–1.45 s&lt;/td&gt;
&lt;td&gt;40–105&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;GPT-4.1-nano&lt;/td&gt;
&lt;td&gt;Cartesia Sonic-Turbo&lt;/td&gt;
&lt;td&gt;ES&lt;/td&gt;
&lt;td&gt;1.00–1.07 / 0.26–0.40 s&lt;/td&gt;
&lt;td&gt;0.43–0.50 s&lt;/td&gt;
&lt;td&gt;0.75–1.53 s&lt;/td&gt;
&lt;td&gt;70–116&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What the Numbers Tell Us
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. First-Turn Overhead Is Real
&lt;/h3&gt;

&lt;p&gt;Every stack shows a &lt;strong&gt;heavier first turn&lt;/strong&gt; because the LLM must ingest the entire system prompt before it can cache the KV-state.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Example:* in the &lt;strong&gt;GPT-4 Mini + Sonic-2 (EN)&lt;/strong&gt; stack the first TTFT clocks at &lt;strong&gt;≈ 1.22 s&lt;/strong&gt;, but subsequent turns fall to &lt;strong&gt;≈ 0.42–0.90 s&lt;/strong&gt;. The “prompt tax” is ~300–800 ms, and it vanishes after turn 2 because the model re-uses its internal cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. We’re Getting Closer to Human Latency — But Not Quite There Yet
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human comfort band:&lt;/strong&gt; ~0.1–0.4 s one-way; anything above &lt;strong&gt;0.6–0.7 s&lt;/strong&gt; starts to feel "robotic."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best first syllable today:&lt;/strong&gt; &lt;strong&gt;0.73 s&lt;/strong&gt; (GPT-4 Nano + Sonic-Turbo, EN) and &lt;strong&gt;0.75 s&lt;/strong&gt; (same stack, ES).
&lt;em&gt;That’s about 2× slower than a natural gap, but already below the ITU’s 400 ms threshold for unacceptable RTT.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Second turn latency:&lt;/strong&gt; Since TTFT drops to &lt;strong&gt;0.26–0.40 s&lt;/strong&gt; and TTFB remains around &lt;strong&gt;0.43 s&lt;/strong&gt;, many loops land &lt;strong&gt;just under 0.7–0.8 s&lt;/strong&gt;—close enough that most users don’t perceive a delay.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. English Still Wins the Speed Race
&lt;/h3&gt;

&lt;p&gt;Across the board, Spanish incurs an extra &lt;strong&gt;+300–500 ms&lt;/strong&gt; in TTFT, and often a few additional milliseconds in TTFB.&lt;br&gt;&lt;br&gt;
This isn't surprising: most language models are trained on English-dominant datasets, and their tokenizers are optimized for English morphology. That means fewer tokens per word, higher-confidence predictions, and faster decoding paths.  &lt;/p&gt;

&lt;p&gt;In contrast, other languages often lead to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More tokens per sentence (due to suboptimal tokenization),&lt;/li&gt;
&lt;li&gt;Less frequent vocabulary (slower logits resolution),&lt;/li&gt;
&lt;li&gt;Slightly longer prompts (higher input load),&lt;/li&gt;
&lt;li&gt;And more uncertainty during generation (costlier decoding).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model providers are still actively optimizing multilingual performance—but for now, English remains the latency benchmark.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. STT Is a Non-Issue (When Streamed)
&lt;/h3&gt;

&lt;p&gt;Deepgram’s streaming mode continually emits tokens, so by the time the user finishes speaking the transcript is already done. &lt;strong&gt;&amp;lt; 5 ms&lt;/strong&gt; in our logs—effectively zero.&lt;/p&gt;




&lt;h2&gt;
  
  
  Which Stack for Whom?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Stack to Watch&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ultra-low latency (&amp;lt; 0.8 s first byte)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;GPT-4 Nano + Cartesia Sonic-Turbo&lt;/strong&gt; (rows 13–14)&lt;/td&gt;
&lt;td&gt;Fastest TTFT (&amp;lt; 1 s first turn, &amp;lt; 0.40 s thereafter). Great for IVRs, live game NPCs, or any app where “snappiness” beats eloquence. Expect slightly terser, less nuanced language.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Balanced latency &amp;amp; quality&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;GPT-4 Mini + Cartesia Sonic-2 / Sonic-Turbo&lt;/strong&gt; (rows 7–8 &amp;amp; 10–11)&lt;/td&gt;
&lt;td&gt;Adds ~150-250 ms but yields noticeably richer wording and better reasoning. Sweet spot for customer support or sales calls where tone matters.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language coverage beyond English&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Mini or Nano stacks + Sonic-Turbo (ES)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spanish numbers are catching up; Sonic voices remain natural and the Nano drop still delivers TTFT &amp;lt; 1.1 s.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Premium voice fidelity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;ElevenLabs + Mini stacks&lt;/strong&gt; (rows 1–4)&lt;/td&gt;
&lt;td&gt;Neural voices lead the market in prosody; latency penalty is ~0.05–0.1 s vs. Sonic-Turbo—fine for podcasts, high-touch brand experiences.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(Quality judgments are subjective; we used blind AB tests on 30 clips per stack.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions &amp;amp; Near-Term Outlook
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Composable beats monolithic—today.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Because STT, LLM, and TTS evolve on different cadences, a modular pipeline lets you upgrade components the moment something faster drops—unlike monolithic models, where you must wait for the next provider release.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sub-second voice loops are already viable&lt;/strong&gt; for English and edging in for Spanish. With smarter caching, phoneme-level streaming, and incremental TTS we expect &lt;strong&gt;&amp;lt; 500 ms&lt;/strong&gt; within a year.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model shrinkage will continue.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
“Nano” and “flash” checkpoints show that aggressive distillation + quantization can keep quality “good enough” while halving latency every generation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Edge deployment is accelerating.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Thanks to aggressive quantization (8-bit and even 4-bit), large language and speech models are now deployable on local hardware—consumer GPUs, mobile NPUs, and even embedded systems. This allows parts of the voice loop to run &lt;strong&gt;on-device&lt;/strong&gt;, cutting out network delays and shaving &lt;strong&gt;50–150 ms&lt;/strong&gt; off total latency.&lt;br&gt;&lt;br&gt;
&lt;a href="https://substack.com/home/post/p-160808933?source=queue&amp;amp;utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Source: “AI Voice Inference at the Edge is Finally Here,” &lt;em&gt;VoiceTech Insights, 2025&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Joint LLM-TTS training is emerging.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A new generation of end-to-end speech models is beginning to bypass traditional TTS stages entirely. These models, like &lt;strong&gt;VITA-Audio&lt;/strong&gt; (2025), predict multiple audio tokens in a single step, generating speech directly from text while drastically reducing inference time. Once stable in streaming mode, these architectures could cut TTS latency to &lt;strong&gt;mere milliseconds&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
&lt;a href="https://arxiv.org/abs/2505.03739" rel="noopener noreferrer"&gt;Source: “VITA-Audio: Parallel Token-to-Audio Generation with Context-Aware Semantic Guidance,” &lt;em&gt;arXiv, May 2025&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; We’re only a few iteration cycles away from voice agents that &lt;em&gt;consistently&lt;/em&gt; reply in the same temporal rhythm as humans. If you build with LiveKit-style modular pipelines today, you can ride that curve with an overnight adjustment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Stay tuned—the sub-half-second voice loop is closer than most teams think.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>livekit</category>
      <category>voice</category>
    </item>
    <item>
      <title>Building Voice AI Agents with the OpenAI Agents SDK</title>
      <dc:creator>Roman Piacquadio</dc:creator>
      <pubDate>Mon, 05 May 2025 12:49:15 +0000</pubDate>
      <link>https://dev.to/cloudx/building-voice-ai-agents-with-the-openai-agents-sdk-2aog</link>
      <guid>https://dev.to/cloudx/building-voice-ai-agents-with-the-openai-agents-sdk-2aog</guid>
      <description>&lt;h2&gt;
  
  
  Beyond Single Turns: OpenAI Enters the Voice Agent Arena
&lt;/h2&gt;

&lt;p&gt;In our previous post, &lt;a href="https://dev.to/cloudx/building-multi-agent-conversations-with-webrtc-livekit-48f1"&gt;Building Multi-Agent Conversations with WebRTC &amp;amp; LiveKit&lt;/a&gt;, we explored how to create complex, multi-stage voice interactions using the real-time power of WebRTC and the orchestration capabilities of the LiveKit Agents framework. We saw how crucial low latency and effective state management are for natural conversations, especially when handing off between different agent roles.&lt;/p&gt;

&lt;p&gt;Recently, OpenAI has significantly enhanced its offerings for building agentic systems, including dedicated tools and SDKs for creating voice agents. While the core concept of chaining Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS) remains, OpenAI now provides more integrated primitives and an SDK designed to simplify this process, particularly within their ecosystem.&lt;/p&gt;

&lt;p&gt;This article dives into building voice agents using the OpenAI Agents SDK. We'll examine its architecture, walk through a Python example, and critically compare this approach with the LiveKit method discussed previously, highlighting the strengths, weaknesses, and ideal use cases for each.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenAI's Vision for Agents: Primitives and Orchestration
&lt;/h2&gt;

&lt;p&gt;OpenAI positions its platform as a set of composable primitives for building agents, covering domains like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Models:&lt;/strong&gt; Core intelligence (GPT-4o, the latest GPT-4.1 and GPT-4.1-mini, etc.) capable of reasoning and handling multimodality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; Interfaces to the outside world, including developer-defined function calling, built-in web search, file search, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge &amp;amp; Memory:&lt;/strong&gt; Using Vector Stores and Embeddings for context and persistence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio &amp;amp; Speech:&lt;/strong&gt; Primitives for understanding and generating voice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails:&lt;/strong&gt; Moderation and instruction hierarchy for safety and control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration:&lt;/strong&gt; The Agents SDK, Tracing, Evaluations, and Fine-tuning to manage the agent lifecycle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;strong&gt;Voice Agents&lt;/strong&gt;, OpenAI presents two main architectural paths:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Speech-to-Speech (Multimodal - Realtime API):&lt;/strong&gt; Uses models like gpt-4o-realtime-preview that process audio input directly and generate audio output, aiming for the lowest latency and understanding vocal nuances. This uses a specific Realtime API separate from the main Chat Completions API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chained (Agents SDK + Voice):&lt;/strong&gt; The more traditional STT → LLM → TTS flow, but orchestrated using the openai-agents SDK with its [voice] extension. This provides more transparency (text transcripts at each stage) and control, making it easier to integrate into existing text-based agent workflows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;This post will focus on the Chained architecture using the OpenAI Agents SDK, as it aligns more closely with common agent development patterns and provides a clearer comparison point to the plugin-based approach of LiveKit.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The OpenAI Agents SDK: Simplifying Agent Logic
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;openai-agents&lt;/code&gt; Python SDK aims to provide a lightweight way to build agents with a few core concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; An LLM equipped with instructions, tools, and potentially knowledge about when to hand off tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handoffs:&lt;/strong&gt; A mechanism allowing one agent to delegate tasks to another, more specialized agent. Agents are configured with a list of potential agents they can hand off to.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tools (&lt;code&gt;@function_tool&lt;/code&gt;):&lt;/strong&gt; Decorator to easily expose Python functions to the agent, similar to standard OpenAI function calling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails:&lt;/strong&gt; Functions to validate inputs or outputs and enforce constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runner:&lt;/strong&gt; Executes the agent logic, handling the loop of calling the LLM, executing tools, and managing handoffs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VoicePipeline (with [voice] extra):&lt;/strong&gt; Wraps an agent workflow (like one using Runner) to handle the STT and TTS parts of a voice interaction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The philosophy is "Python-first," relying on Python's built-in features for orchestration rather than introducing many complex abstractions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture with OpenAI Agents SDK (Chained Voice)
&lt;/h2&gt;

&lt;p&gt;When using the &lt;code&gt;VoicePipeline&lt;/code&gt; from the SDK, the typical flow for a voice turn looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audio Input:&lt;/strong&gt; Raw audio data (e.g., from a microphone) is captured.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VoicePipeline (STT):&lt;/strong&gt; The pipeline receives audio chunks. It uses an OpenAI STT model (like &lt;code&gt;gpt-4o-transcribe&lt;/code&gt; via the API) to transcribe the user's speech into text once speech ends (or via push-to-talk).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Workflow Execution (MyWorkflow.run in the example):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The transcribed text is passed to your defined workflow (e.g., a class inheriting from &lt;code&gt;VoiceWorkflowBase&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Inside the workflow, the &lt;code&gt;Runner&lt;/code&gt; is invoked with the current &lt;code&gt;Agent&lt;/code&gt;, conversation history, and the new user text.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;Agent&lt;/code&gt; (LLM) decides whether to respond directly, call a &lt;code&gt;Tool&lt;/code&gt; (function), or &lt;code&gt;Handoff&lt;/code&gt; to another agent based on its instructions and the user input.&lt;/li&gt;
&lt;li&gt;If a tool is called, the &lt;code&gt;Runner&lt;/code&gt; executes the Python function and sends the result back to the LLM.&lt;/li&gt;
&lt;li&gt;If a handoff occurs, the &lt;code&gt;Runner&lt;/code&gt; switches context to the new agent.&lt;/li&gt;
&lt;li&gt;The LLM generates the text response.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VoicePipeline (TTS):&lt;/strong&gt; The final text response from the agent workflow is sent to an OpenAI TTS model (e.g., &lt;code&gt;gpt-4o-mini-tts&lt;/code&gt;) via the API to generate audio.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio Output:&lt;/strong&gt; The generated audio data is streamed back to be played to the user.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fxvbdr20c6orgccvawj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fxvbdr20c6orgccvawj.png" alt="VoicePipeline workflow" width="800" height="404"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;(Diagram: Microphone feeds audio to VoicePipeline for STT. Text goes to Agent Workflow (using Runner, Agent, Tools, Handoffs). Text response goes back to VoicePipeline for TTS, then to Speaker.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This contrasts with the LiveKit architecture where WebRTC handles the audio transport layer directly, and the &lt;code&gt;livekit-agents&lt;/code&gt; framework integrates STT/LLM/TTS plugins into that real-time stream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's Build: The Multi-Lingual Assistant (Python Example)
&lt;/h2&gt;

&lt;p&gt;Let's break down the key parts of the official OpenAI Agents SDK voice example. (Link to the repository will be at the end).&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.8+&lt;/li&gt;
&lt;li&gt;OpenAI API Key.&lt;/li&gt;
&lt;li&gt;Install the SDK with voice extras:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"openai-agents[voice]"&lt;/span&gt; sounddevice numpy python-dotenv textual &lt;span class="c"&gt;# For the demo UI&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Setup (.env file)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env&lt;/span&gt;
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. Core Agent Logic (&lt;code&gt;my_workflow.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;This file defines the agents and the workflow logic that runs after speech is transcribed to text and before the response text is sent for synthesis.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Imports:&lt;/strong&gt; Necessary components from &lt;code&gt;agents&lt;/code&gt; SDK (&lt;code&gt;Agent&lt;/code&gt;, &lt;code&gt;Runner&lt;/code&gt;, &lt;code&gt;function_tool&lt;/code&gt;, &lt;code&gt;VoiceWorkflowBase&lt;/code&gt;, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Definition (&lt;code&gt;get_weather&lt;/code&gt;):&lt;/strong&gt; A simple Python function decorated with &lt;code&gt;@function_tool&lt;/code&gt; to make it callable by the &lt;code&gt;agent&lt;/code&gt;. The SDK handles generating the schema for the LLM.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections.abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncIterator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TResponseInputItem&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;function_tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agents.extensions.handoff_prompt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;prompt_with_handoff_instructions&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agents.voice&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VoiceWorkflowBase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VoiceWorkflowHelper&lt;/span&gt;

&lt;span class="nd"&gt;@function_tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get the weather for a given city.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[debug] get_weather called with city: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;choices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sunny&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rainy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;snowy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The weather in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent Definitions (&lt;code&gt;spanish_agent&lt;/code&gt;, &lt;code&gt;agent&lt;/code&gt;):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Each &lt;code&gt;Agent&lt;/code&gt; is created with a &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;instructions&lt;/code&gt; (using a helper &lt;code&gt;prompt_with_handoff_instructions&lt;/code&gt; to guide its behavior regarding handoffs), a &lt;code&gt;model&lt;/code&gt;, and optionally &lt;code&gt;tools&lt;/code&gt; it can use and other &lt;code&gt;handoffs&lt;/code&gt; it can initiate.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;handoff_description&lt;/code&gt; helps the calling agent decide which agent to hand off to.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spanish_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spanish&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handoff_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A spanish speaking agent.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;prompt_with_handoff_instructions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re speaking to a human, so be polite and concise. Speak in Spanish.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;prompt_with_handoff_instructions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;re speaking to a human, so be polite and concise. If the user speaks in Spanish, handoff to the spanish agent.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;handoffs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;spanish_agent&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="c1"&gt;# List of agents it can hand off to
&lt;/span&gt;    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;get_weather&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;      &lt;span class="c1"&gt;# List of tools it can use
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Workflow Class (&lt;code&gt;MyWorkflow&lt;/code&gt;):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Inherits from &lt;code&gt;VoiceWorkflowBase&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;__init__&lt;/code&gt;: Stores configuration (like the &lt;code&gt;secret_word&lt;/code&gt; for a simple game logic) and maintains state like conversation history (&lt;code&gt;_input_history&lt;/code&gt;) and the currently active agent (&lt;code&gt;_current_agent&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;run(transcription: str)&lt;/code&gt;: This is the core method called by the &lt;code&gt;VoicePipeline&lt;/code&gt; after STT.&lt;/li&gt;
&lt;li&gt;It receives the user's transcribed text.&lt;/li&gt;
&lt;li&gt;Updates the conversation history.&lt;/li&gt;
&lt;li&gt;Contains custom logic (like checking for the secret word).&lt;/li&gt;
&lt;li&gt;Invokes &lt;code&gt;Runner.run_streamed&lt;/code&gt; with the current agent and history. This handles the interaction with the LLM, tool calls, and potential handoffs based on the agent's configuration.&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;VoiceWorkflowHelper.stream_text_from&lt;/code&gt; to yield text chunks as they are generated by the LLM (enabling faster TTS start).&lt;/li&gt;
&lt;li&gt;Updates the history and potentially the &lt;code&gt;_current_agent&lt;/code&gt; based on the &lt;code&gt;Runner&lt;/code&gt;'s result (if a handoff occurred).
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyWorkflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;VoiceWorkflowBase&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret_word&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="c1"&gt;# ... (init stores history, current_agent, secret_word, callback) ...
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_input_history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;TResponseInputItem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_current_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_secret_word&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;secret_word&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_on_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;on_start&lt;/span&gt; &lt;span class="c1"&gt;# Callback for UI updates
&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AsyncIterator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_on_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Call the UI callback
&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_input_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_secret_word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt; &lt;span class="c1"&gt;# Custom logic example
&lt;/span&gt;            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You guessed the secret word!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="c1"&gt;# ... (update history) ...
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt;

        &lt;span class="c1"&gt;# Run the agent logic using the Runner
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_streamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_current_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_input_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Stream text chunks for faster TTS
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;VoiceWorkflowHelper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream_text_from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;

        &lt;span class="c1"&gt;# Update state for the next turn
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_input_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_input_list&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_current_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_agent&lt;/span&gt; &lt;span class="c1"&gt;# Agent might have changed via handoff
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Client &amp;amp; Pipeline Setup (&lt;code&gt;main.py&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;This file sets up a simple Textual-based UI and manages the audio input/output and the &lt;code&gt;VoicePipeline&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It initializes &lt;code&gt;sounddevice&lt;/code&gt; for microphone input and speaker output.&lt;/li&gt;
&lt;li&gt;Creates the &lt;code&gt;VoicePipeline&lt;/code&gt;, passing in the &lt;code&gt;MyWorkflow&lt;/code&gt; instance.&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;StreamedAudioInput&lt;/code&gt; to feed microphone data into the pipeline.&lt;/li&gt;
&lt;li&gt;Starts the pipeline using &lt;code&gt;pipeline.run(self._audio_input)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Asynchronously iterates through the &lt;code&gt;result.stream()&lt;/code&gt; to:

&lt;ul&gt;
&lt;li&gt;Play back audio chunks (&lt;code&gt;voice_stream_event_audio&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Display lifecycle events or transcriptions in the UI.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Handles starting/stopping recording based on key presses ('k').
&lt;em&gt;(Note: We won't dive deep into the Textual UI code here, focusing instead on the agent interaction pattern.)&lt;/em&gt;
&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Running the Example
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ensure .env is set up.&lt;/li&gt;
&lt;li&gt;Run the main script: python main.py&lt;/li&gt;
&lt;li&gt;Press 'k' to start recording, speak, press 'k' again to stop. The agent should respond.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Comparing Approaches: OpenAI Agents SDK vs. LiveKit Agents
&lt;/h2&gt;

&lt;p&gt;Both frameworks allow building sophisticated voice agents with multiple roles, but they excel in different areas due to their underlying philosophies and technologies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OpenAI Agents SDK (Chained Voice)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;LiveKit Agents Framework (WebRTC)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core Technology&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🐍 Python SDK orchestrating OpenAI APIs (STT, LLM, TTS)&lt;/td&gt;
&lt;td&gt;🌐 Python Framework built on LiveKit &amp;amp; WebRTC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Higher (API calls for STT, LLM, TTS per turn)&lt;/td&gt;
&lt;td&gt;✅ Lower (Direct WebRTC streaming, optimized for voice)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time Audio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Handled by SDK (VoicePipeline), abstracts away stream&lt;/td&gt;
&lt;td&gt;✅ Core feature via WebRTC, fine-grained control possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Setup Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Generally Lower (mainly SDK install &amp;amp; API keys)&lt;/td&gt;
&lt;td&gt;⚠️ Higher (Requires LiveKit server setup/cloud account)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;STT/TTS Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Primarily uses OpenAI models via API.&lt;/td&gt;
&lt;td&gt;✅ Plugin-based (OpenAI, Deepgram, Google, etc.) easy swap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM Flexibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Uses OpenAI models via API.&lt;/td&gt;
&lt;td&gt;✅ Plugin-based (OpenAI, Anthropic, Local models, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interruption Handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Not built-in for StreamedAudioInput. Requires manual implementation listening to lifecycle events.&lt;/td&gt;
&lt;td&gt;✅ Built-in using VAD plugins (e.g., Silero).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Managed within Python workflow (e.g., list history)&lt;/td&gt;
&lt;td&gt;✅ Explicit userdata on AgentSession, shared state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-Agent Handoff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Declarative (handoffs list in Agent)&lt;/td&gt;
&lt;td&gt;⚠️ Imperative (Agent function returns next agent instance)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ecosystem&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Integrated with OpenAI Tracing, Evals, Fine-tuning.&lt;/td&gt;
&lt;td&gt;⚠️ Focused on real-time communication infrastructure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;⚠️ Depends on Python deployment &amp;amp; API limits.&lt;/td&gt;
&lt;td&gt;✅ Built on scalable WebRTC infrastructure (LiveKit).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Note on OpenAI Realtime API:&lt;/strong&gt; OpenAI does offer the gpt-4o-realtime-preview model via a separate Realtime API for true speech-to-speech with potentially very low latency. However, this is a different architecture than the Agents SDK VoicePipeline discussed here, uses specific models, and has its own implementation details.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Choose Which?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Choose OpenAI Agents SDK (Chained Voice) When:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;You primarily want to work within the OpenAI ecosystem (Models, Tracing, Evals).&lt;/li&gt;
&lt;li&gt;Your application can tolerate slightly higher latency inherent in the chained API calls.&lt;/li&gt;
&lt;li&gt;You prefer a simpler initial setup without managing WebRTC infrastructure.&lt;/li&gt;
&lt;li&gt;You need transparency with text transcripts at each stage (STT output, LLM input/output).&lt;/li&gt;
&lt;li&gt;Built-in, low-latency interruption handling is not a critical out-of-the-box requirement.&lt;/li&gt;
&lt;li&gt;Your core logic is already text-based, and you're adding a voice interface.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Choose LiveKit Agents Framework When:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimizing latency&lt;/strong&gt; is paramount for natural turn-taking.&lt;/li&gt;
&lt;li&gt;You need &lt;strong&gt;robust, built-in interruption handling&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You require &lt;strong&gt;flexibility to choose and easily swap&lt;/strong&gt; different STT, LLM, and TTS providers (including non-OpenAI or self-hosted).&lt;/li&gt;
&lt;li&gt;You need fine-grained control over the real-time audio/video streams (WebRTC).&lt;/li&gt;
&lt;li&gt;You are building applications that inherently benefit from a "room"-based model (e.g., multiple users, agent joining calls).&lt;/li&gt;
&lt;li&gt;Scalability for many concurrent real-time connections is a primary concern.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenAI's introduction of the Agents SDK, especially with its voice capabilities, provides a compelling and relatively straightforward path for developers already invested in their ecosystem to build voice agents. The VoicePipeline abstracts away some of the complexities of the STT → LLM → TTS chain. Its strengths lie in integration with OpenAI's tools (like tracing) and the declarative nature of defining agents, tools, and handoffs.&lt;/p&gt;

&lt;p&gt;However, for applications demanding the absolute lowest latency, seamless interruption handling, and maximum flexibility in choosing underlying AI models, the WebRTC-based approach offered by frameworks like LiveKit Agents remains a very strong contender. It requires more infrastructure setup but provides unparalleled control over the real-time aspects of the conversation.&lt;/p&gt;

&lt;p&gt;The choice depends heavily on your specific project requirements, tolerance for latency, need for flexibility, and existing technology stack. Both approaches offer powerful ways to move beyond simple bots and create truly interactive voice AI experiences.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explore the &lt;a href="https://openai.github.io/openai-agents-python/" rel="noopener noreferrer"&gt;OpenAI Agents SDK Documentation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Check out the &lt;a href="https://github.com/openai/openai-agents-python" rel="noopener noreferrer"&gt;OpenAI Agents GitHub Repository&lt;/a&gt; and the &lt;a href="https://github.com/openai/openai-agents-python/tree/main/examples/voice" rel="noopener noreferrer"&gt;voice example&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Learn about the &lt;a href="https://platform.openai.com/docs/guides/realtime" rel="noopener noreferrer"&gt;OpenAI Realtime API&lt;/a&gt; for speech-to-speech.&lt;/li&gt;
&lt;li&gt;Revisit the &lt;a href="https://docs.livekit.io/agents/" rel="noopener noreferrer"&gt;LiveKit Agents Documentation&lt;/a&gt; for comparison.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What are your thoughts on these different approaches to building voice agents? Let me know in the comments!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>openai</category>
      <category>python</category>
    </item>
    <item>
      <title>Building Multi-Agent Conversations with WebRTC &amp; LiveKit</title>
      <dc:creator>Roman Piacquadio</dc:creator>
      <pubDate>Thu, 10 Apr 2025 12:53:13 +0000</pubDate>
      <link>https://dev.to/cloudx/building-multi-agent-conversations-with-webrtc-livekit-48f1</link>
      <guid>https://dev.to/cloudx/building-multi-agent-conversations-with-webrtc-livekit-48f1</guid>
      <description>&lt;h2&gt;
  
  
  From Simple Bots to Dynamic Conversations
&lt;/h2&gt;

&lt;p&gt;We've all seen the basic voice AI demos – ask a question, get an answer. But real-world interactions often involve multiple stages, different roles, or specialized knowledge. How do you build a voice AI system that can gracefully handle introductions, gather information, perform a core task, and then provide a conclusion, potentially using different AI "personalities" or models along the way?&lt;/p&gt;

&lt;p&gt;Chaining traditional REST API calls for STT, LLM, and TTS already introduces latency and state management headaches for a &lt;em&gt;single&lt;/em&gt; agent turn. Trying to orchestrate &lt;em&gt;multiple&lt;/em&gt; logical agents or conversational stages this way becomes exponentially more complex, laggy, and brittle.&lt;/p&gt;

&lt;p&gt;This article explores a powerful solution: building &lt;strong&gt;multi-agent voice AI sessions&lt;/strong&gt; using &lt;strong&gt;WebRTC&lt;/strong&gt; for real-time communication and the &lt;strong&gt;LiveKit Agents framework&lt;/strong&gt; for orchestration. We'll look at a practical Python example of a "storyteller" agent that first gathers user info and then hands off to a specialized story-generating agent, all within a single, low-latency voice call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Does the Standard API Approach Fall Short (Especially for Multi-Agent)?
&lt;/h2&gt;

&lt;p&gt;The typical STT → LLM → TTS cycle via separate API calls suffers from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐌 &lt;strong&gt;High Latency:&lt;/strong&gt; Each step adds delay, making turn-taking slow.&lt;/li&gt;
&lt;li&gt;💸 &lt;strong&gt;Potential High Cost:&lt;/strong&gt; Multiple API calls per user turn can get expensive.&lt;/li&gt;
&lt;li&gt;🧠 &lt;strong&gt;State Management Hell:&lt;/strong&gt; Keeping track of conversation history and shared data &lt;em&gt;across different logical stages or agents&lt;/em&gt; is difficult with stateless APIs.&lt;/li&gt;
&lt;li&gt;📉 &lt;strong&gt;Reliability &amp;amp; Scalability Issues:&lt;/strong&gt; A single backend process trying to juggle multiple users &lt;em&gt;and&lt;/em&gt; complex state logic becomes a bottleneck and point of failure.&lt;/li&gt;
&lt;li&gt;😠 &lt;strong&gt;Robotic Interaction:&lt;/strong&gt; Difficulty handling interruptions or smoothly transitioning between conversational goals.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: Real-Time Foundations with WebRTC &amp;amp; LiveKit
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;WebRTC (Web Real-Time Communication):&lt;/strong&gt; This browser and mobile standard allows direct, low-latency audio/video/data streaming between participants. It's the foundation for seamless voice calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LiveKit:&lt;/strong&gt; An open-source infrastructure layer that makes building scalable, reliable WebRTC applications &lt;em&gt;much&lt;/em&gt; easier. For voice AI, it provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Signaling &amp;amp; Room Management:&lt;/strong&gt; Handles participant connections, discovery, and the state of the "room" where the conversation happens. This is our "virtual conference room."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimized Media Streaming:&lt;/strong&gt; Ensures audio flows efficiently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; Designed for many concurrent users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Framework (&lt;code&gt;livekit-agents&lt;/code&gt;):&lt;/strong&gt; A specific Python library (and SDKs for other languages) built on LiveKit, designed explicitly for creating voice (and text) AI agents. It manages the complexities of:

&lt;ul&gt;
&lt;li&gt;Real-time audio stream processing.&lt;/li&gt;
&lt;li&gt;Integrating STT, LLM, TTS plugins.&lt;/li&gt;
&lt;li&gt;Handling interruptions (using VAD - Voice Activity Detection).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Crucially for this article: Managing multiple agents within a single session and facilitating handoffs and shared state.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Multi-Agent Architecture with LiveKit Agents
&lt;/h2&gt;

&lt;p&gt;The provided example demonstrates a pattern where agents collaborate within a single &lt;code&gt;AgentSession&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;User &amp;amp; Session Start:&lt;/strong&gt; The user connects to a LiveKit room. An &lt;code&gt;AgentSession&lt;/code&gt; is created, managing the overall interaction and holding shared &lt;code&gt;userdata&lt;/code&gt;. The &lt;em&gt;initial&lt;/em&gt; agent (&lt;code&gt;IntroAgent&lt;/code&gt;) is added to the session.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agent 1 (&lt;code&gt;IntroAgent&lt;/code&gt;) Execution:&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* Receives the user's audio stream.
* Uses its configured STT, LLM (e.g., GPT-4o-mini), and TTS to interact according to its specific `instructions` (gather name and location).
* The LLM identifies when the required information is gathered.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Agent Handoff:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* The LLM triggers a specific function call defined on `IntroAgent` (e.g., `information_gathered`).
* This function receives the gathered data (`name`, `location`).
* It **updates the shared `userdata`** within the `AgentSession`.
* It **creates an instance of the *next* agent** (`StoryAgent`), potentially passing it the gathered data or the existing chat context.
* It **returns the `StoryAgent` instance** to the `AgentSession`.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Agent 2 (&lt;code&gt;StoryAgent&lt;/code&gt;) Activation:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* The `AgentSession` replaces `IntroAgent` with `StoryAgent`.
* `StoryAgent`'s `on_enter` method might be called to kick off its part of the conversation.
* It now handles the user's audio stream, using its *own* potentially different configuration (e.g., real-time OpenAI LLM with built-in voice) and `instructions` (tell a story using the name/location from `userdata`).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Further Interaction &amp;amp; Termination:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;* The `StoryAgent` interacts with the user.
* When the story concludes (potentially triggered by another function call like `story_finished`), the agent can generate a final message and even initiate disconnecting the user or closing the LiveKit room.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feibx71borgo1xvoagvwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feibx71borgo1xvoagvwj.png" alt="Simple diagram showing User -&amp;gt; AgentSession -&amp;gt; Agent 1 &amp;lt;-&amp;gt; Agent 2" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;(Diagram: User interacts with the AgentSession, which activates Agent 1 or Agent 2. Agents can access/modify shared userdata and trigger handoffs.)&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Let's Build: The Multi-Agent Storyteller (Python)
&lt;/h2&gt;

&lt;p&gt;Let's break down the key parts of the provided code. (There will be a link to the official livekit repository at the end)&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Prerequisites:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.8+&lt;/li&gt;
&lt;li&gt;LiveKit Cloud account (&lt;a href="https://cloud.livekit.io/" rel="noopener noreferrer"&gt;free tier&lt;/a&gt;) or self-hosted server (URL, API Key, API Secret).&lt;/li&gt;
&lt;li&gt;OpenAI API Key.&lt;/li&gt;
&lt;li&gt;Deepgram API Key.&lt;/li&gt;
&lt;li&gt;(Optional) Silero VAD model (downloaded automatically by the plugin).&lt;/li&gt;
&lt;li&gt;Install libraries:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"livekit-agents[openai,silero,deepgram]~=1.0rc"&lt;/span&gt; python-dotenv &lt;span class="c"&gt;# Add any other plugins you might use&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  2. Setup (&lt;code&gt;.env&lt;/code&gt; file):
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env&lt;/span&gt;
&lt;span class="nv"&gt;LIVEKIT_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"wss://YOUR_PROJECT_URL.livekit.cloud"&lt;/span&gt;
&lt;span class="nv"&gt;LIVEKIT_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"YOUR_LK_API_KEY"&lt;/span&gt;
&lt;span class="nv"&gt;LIVEKIT_API_SECRET&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"YOUR_LK_API_SECRET"&lt;/span&gt;

&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-..."&lt;/span&gt;
&lt;span class="nv"&gt;DEEPGRAM_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;
&lt;span class="c"&gt;# Optional webhook for StoryAgent's finish function if needed&lt;/span&gt;
&lt;span class="c"&gt;# CRM_WEBHOOK_URL="..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  3. Core Agent Code Breakdown (agent.py):
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Imports:&lt;/strong&gt; Loads essential Python standard libraries (&lt;code&gt;logging&lt;/code&gt;, &lt;code&gt;dataclasses&lt;/code&gt;, &lt;code&gt;typing&lt;/code&gt;, &lt;code&gt;dotenv&lt;/code&gt;) and core components from the &lt;code&gt;livekit-agents&lt;/code&gt; SDK:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Agent&lt;/code&gt;, &lt;code&gt;AgentSession&lt;/code&gt;, and &lt;code&gt;RunContext&lt;/code&gt;: Power the lifecycle and logic of agents.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RoomInputOptions&lt;/code&gt; and &lt;code&gt;RoomOutputOptions&lt;/code&gt;: Configure how audio is handled in the LiveKit room.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;function_tool&lt;/code&gt;: Allows the LLM to expose callable functions as part of the agent logic.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deepgram&lt;/code&gt;, &lt;code&gt;openai&lt;/code&gt;, &lt;code&gt;silero&lt;/code&gt;: Pre-built plugin integrations for STT (speech-to-text), LLM (language model), TTS (text-to-speech), and VAD (voice activity detection).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup gives you full control and &lt;strong&gt;modularity&lt;/strong&gt; over what models and services each agent uses for voice and language tasks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;livekit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;api&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;livekit.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;AgentSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ChatContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;JobContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;JobProcess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RoomInputOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RoomOutputOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RunContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;WorkerOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cli&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;livekit.agents.job&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_current_job_context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;livekit.agents.llm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;function_tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;livekit.agents.voice&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MetricsCollectedEvent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;livekit.plugins&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deepgram&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;silero&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logger &amp;amp; Environment Setup:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;logger = logging.getLogger("multi-agent")&lt;/code&gt;: Sets up a logger instance named &lt;code&gt;"multi-agent"&lt;/code&gt; for outputting logs throughout the agent lifecycle. Helpful for debugging and usage tracking.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;load_dotenv()&lt;/code&gt;: Loads environment variables from a &lt;code&gt;.env&lt;/code&gt; file, allowing credentials or config values (e.g., API keys for OpenAI or Deepgram) to be securely managed outside the code.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multi-agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instruction Prompt:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;common_instructions&lt;/code&gt;: A base prompt string used to define the persona and behavior of both agents. In this case, the agent is introduced as &lt;strong&gt;"Echo"&lt;/strong&gt;, a friendly and curious storyteller who interacts via voice.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;common_instructions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Your name is Echo. You are a story teller that interacts with the user via voice.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;StoryData&lt;/code&gt; &lt;strong&gt;Dataclass:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StoryData&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This simple structure holds the shared state (user's name and location) that needs to persist between agents. It's attached to the &lt;code&gt;AgentSession&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IntroAgent&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;__init__&lt;/code&gt;: Sets specific &lt;code&gt;instructions&lt;/code&gt; to gather name and location. Uses default models configured later in the &lt;code&gt;AgentSession&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_enter&lt;/code&gt;: Called when this agent becomes active. It immediately prompts the LLM to generate a reply based on its instructions (the introduction).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;information_gathered&lt;/code&gt; &lt;strong&gt;Function Tool&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IntroAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;common_instructions&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; Your goal is to gather a few pieces of &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;information from the user to make the story personalized and engaging.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You should ask the user for their name and where they are from.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Start the conversation with a short introduction.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_enter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# when the agent is added to the session, it'll generate a reply
&lt;/span&gt;        &lt;span class="c1"&gt;# according to its instructions
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_reply&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@function_tool&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;information_gathered&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RunContext&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;StoryData&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Called when the user has provided the information needed to make the story
        personalized and engaging.

        Args:
            name: The name of the user
            location: The location of the user
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;

        &lt;span class="n"&gt;story_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StoryAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# by default, StoryAgent will start with a new context, to carry through the current
&lt;/span&gt;        &lt;span class="c1"&gt;# chat history, pass in the chat_ctx
&lt;/span&gt;        &lt;span class="c1"&gt;# story_agent = StoryAgent(name, location, chat_ctx=context.chat_ctx)
&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;story_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Let&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s start the story!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the key handoff mechanism. The LLM calls this function. It updates userdata, creates StoryAgent, and returns it to the session manager.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;StoryAgent&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;__init__&lt;/code&gt;: Takes &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;location&lt;/code&gt; (retrieved from &lt;code&gt;userdata&lt;/code&gt; by the caller). Sets its own &lt;code&gt;instructions&lt;/code&gt; incorporating this data.&lt;/li&gt;
&lt;li&gt;Crucially, it overrides the LLM to use &lt;code&gt;openai.realtime.RealtimeModel&lt;/code&gt; (which includes voice output) and sets &lt;code&gt;tts=None&lt;/code&gt;. This shows agent-specific model configuration.&lt;/li&gt;
&lt;li&gt;It can optionally receive the &lt;code&gt;chat_ctx&lt;/code&gt; to continue the history.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_enter&lt;/code&gt;: Similar to &lt;code&gt;IntroAgent&lt;/code&gt;, starts the interaction.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;story_finished&lt;/code&gt; &lt;strong&gt;Function Tool&lt;/strong&gt;: Allows the LLM to signal the end of the story, generate a goodbye, and terminate the room via the LiveKit API.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StoryAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ChatContext&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;common_instructions&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. You should use the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s information in &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order to make the story personalized.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create the entire story, weaving in elements of their information, and make it &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;interactive, occasionally interating with the user.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;do not end on a statement, where the user is not expected to respond.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;when interrupted, ask if the user would like to continue or end.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s name is &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, from &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# each agent could override any of the model services, including mixing
&lt;/span&gt;            &lt;span class="c1"&gt;# realtime and non-realtime models
&lt;/span&gt;            &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;realtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RealtimeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;voice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;echo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;tts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;chat_ctx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chat_ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_enter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# when the agent is added to the session, we'll initiate the conversation by
&lt;/span&gt;        &lt;span class="c1"&gt;# using the LLM to generate a reply
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_reply&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@function_tool&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;story_finished&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;RunContext&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;StoryData&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;When you are fininshed telling the story (and the user confirms they don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t
        want anymore), call this function to end the conversation.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="c1"&gt;# interrupt any existing generation
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;interrupt&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# generate a goodbye message and hang up
&lt;/span&gt;        &lt;span class="c1"&gt;# awaiting it will ensure the message is played out before returning
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_reply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;say goodbye to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_interruptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;job_ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_current_job_context&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;lkapi&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;job_ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;lkapi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_room&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DeleteRoomRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job_ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;prewarm&lt;/strong&gt; &lt;strong&gt;Function&lt;/strong&gt;: Loads the Silero VAD (Voice Automatic Detection) model once when the worker starts, avoiding redundant loading for each session.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prewarm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JobProcess&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vad&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;silero&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;VAD&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;entrypoint&lt;/strong&gt; &lt;strong&gt;Function&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Connects to the LiveKit room.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Creates the &lt;code&gt;AgentSession&lt;/code&gt;&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;Passes the prewarmed VAD.&lt;/li&gt;
&lt;li&gt;Sets the default STT, LLM, and TTS plugins (agents can override these).&lt;/li&gt;
&lt;li&gt;Initializes the shared &lt;code&gt;userdata&lt;/code&gt; with an empty &lt;code&gt;StoryData&lt;/code&gt; instance.&lt;/li&gt;
&lt;li&gt;Sets up metrics collection (good practice!).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starts the session&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;Crucially, passes the initial agent (&lt;code&gt;IntroAgent()&lt;/code&gt;) to the &lt;code&gt;start&lt;/code&gt; method.&lt;/li&gt;
&lt;li&gt;Configures room input/output options (like noise cancellation or transcription).&lt;/li&gt;
&lt;li&gt;Includes a loop to keep the agent process alive.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;entrypoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;JobContext&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AgentSession&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;StoryData&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;
        &lt;span class="n"&gt;vad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;proc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vad&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="c1"&gt;# any combination of STT, LLM, TTS, or realtime API can be used
&lt;/span&gt;        &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;stt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;deepgram&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;STT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nova-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;tts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TTS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;voice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;echo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini-tts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;userdata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;StoryData&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# log metrics as they are emitted, and total usage after session is over
&lt;/span&gt;    &lt;span class="n"&gt;usage_collector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;UsageCollector&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@session.on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metrics_collected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_on_metrics_collected&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MetricsCollectedEvent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;usage_collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_usage&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;usage_collector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_summary&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Usage: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_shutdown_callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;IntroAgent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;room&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;room_input_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RoomInputOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="c1"&gt;# uncomment to enable Krisp BVC noise cancellation
&lt;/span&gt;            &lt;span class="c1"&gt;# noise_cancellation=noise_cancellation.BVC(),
&lt;/span&gt;        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;room_output_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RoomOutputOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcription_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cli&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_app&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;WorkerOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;entrypoint_fnc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;entrypoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prewarm_fnc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prewarm&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Running &amp;amp; Connecting:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Run the agent: &lt;code&gt;python agent.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Connect using the Agent Playground (link below) or your own client, pointing to your LiveKit instance, ensuring you join the room the agent is listening for (usually determined by how the agent job is launched or configured).&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Why This Multi-Agent Approach Rocks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;🧩 &lt;strong&gt;Modular Roles&lt;/strong&gt;: Each agent focuses on a specific task with its own instructions and even different AI models.&lt;/li&gt;
&lt;li&gt;🧼 &lt;strong&gt;Clean State Management&lt;/strong&gt;: &lt;code&gt;userdata&lt;/code&gt; provides a clear way to share necessary information between agents.&lt;/li&gt;
&lt;li&gt;🔁 &lt;strong&gt;Seamless Handoffs&lt;/strong&gt;: Function calls provide a natural mechanism for transitioning conversational stages.&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Low Latency&lt;/strong&gt;: Still benefits from WebRTC’s real-time streaming.&lt;/li&gt;
&lt;li&gt;🧠 &lt;strong&gt;Flexibility&lt;/strong&gt;: Mix and match standard and real-time models, different STT/TTS providers per agent.&lt;/li&gt;
&lt;li&gt;🏗️ &lt;strong&gt;Scalable&lt;/strong&gt;: Built on the robust LiveKit infrastructure.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Managing complex, multi-stage voice conversations requires moving beyond simple request-response cycles. The LiveKit Agents framework, built on the real-time foundation of WebRTC, provides elegant solutions for orchestrating multiple agents, managing shared state, and facilitating smooth handoffs – all while maintaining low latency.&lt;/p&gt;

&lt;p&gt;This storyteller example showcases the power of this approach, allowing different AI "personalities" or specialists to collaborate within a single, natural-feeling voice session.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dive deeper into the &lt;a href="https://docs.livekit.io/agents" rel="noopener noreferrer"&gt;LiveKit Agents Documentation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Explore the full &lt;a href="https://github.com/livekit/agents/blob/dev-1.0/examples/voice_agents/multi_agent.py" rel="noopener noreferrer"&gt;multi-agent example code&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;To use as a client explore the &lt;a href="https://github.com/livekit/agents-playground/tree/main" rel="noopener noreferrer"&gt;agent-playground example code&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Sign up for &lt;a href="https://cloud.livekit.io" rel="noopener noreferrer"&gt;LiveKit Cloud&lt;/a&gt; to get started quickly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What kind of multi-agent voice interactions would you build with this? Share your ideas in the comments!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voiceai</category>
      <category>agents</category>
      <category>openai</category>
    </item>
  </channel>
</rss>
