<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mart Schweiger</title>
    <description>The latest articles on DEV Community by Mart Schweiger (@martschweiger).</description>
    <link>https://dev.to/martschweiger</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3802221%2Fcdb4c7a2-d4f4-444d-908e-30d6ea3bd1a7.png</url>
      <title>DEV Community: Mart Schweiger</title>
      <link>https://dev.to/martschweiger</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/martschweiger"/>
    <language>en</language>
    <item>
      <title>How accurate are AI transcripts for technical or medical terms?</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:48:25 +0000</pubDate>
      <link>https://dev.to/martschweiger/how-accurate-are-ai-transcripts-for-technical-or-medical-terms-2igb</link>
      <guid>https://dev.to/martschweiger/how-accurate-are-ai-transcripts-for-technical-or-medical-terms-2igb</guid>
      <description>&lt;p&gt;A cardiologist dictates "Start metoprolol 25 mg twice daily" into an &lt;a href="https://www.assemblyai.com/blog/ambient-ai-scribe" rel="noopener noreferrer"&gt;ambient scribe&lt;/a&gt;. The transcript reads "Start metoclopramide 250 mg twice daily." One's a beta-blocker for heart failure. The other's an anti-nausea drug at ten times the intended dose. Two words changed, and you've got a completely different medication at a dangerous dosage sitting in a patient's chart.&lt;/p&gt;

&lt;p&gt;This isn't hypothetical. Medication errors are the most frequent and avoidable source of patient harm in &lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;healthcare&lt;/a&gt;, and transcription mistakes are a direct contributor. But the problem isn't limited to medicine. Legal teams deal with mangled case citations. Engineers watch product names get butchered in meeting transcripts. &lt;a href="https://www.assemblyai.com/solutions/contact-centers" rel="noopener noreferrer"&gt;Contact centers&lt;/a&gt; lose critical account numbers to misrecognition.&lt;/p&gt;

&lt;p&gt;The accuracy of AI transcripts on technical and domain-specific terminology is what separates a useful tool from a liability. This article breaks down why specialized terms are so hard for &lt;a href="https://www.assemblyai.com/products/speech-to-text" rel="noopener noreferrer"&gt;speech-to-text&lt;/a&gt; models, how to actually measure accuracy in ways that matter, and the specific tools and techniques you can use to get transcripts right on the terminology that counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why technical and medical terms are hard for speech-to-text&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Standard speech-to-text models are trained on general language — conversations, podcasts, meetings, news broadcasts. They're optimized for the words people say most often. But technical and medical vocabulary lives in a completely different distribution. These terms are rare in general training data, phonetically complex, and full of ambiguity that only domain context can resolve.&lt;/p&gt;

&lt;p&gt;Here's what makes them so challenging.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Similar-sounding terminology&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Medical vocabulary is packed with near-homophones that mean completely different things. "Metoprolol" (a beta-blocker for heart conditions) and "metoclopramide" (an anti-nausea medication) sound almost identical when spoken quickly. "Celebrex" (an anti-inflammatory) and "Celexa" (an antidepressant) are one syllable apart. "Lamictal" (for seizures) and "Lamisil" (for fungal infections) could easily be swapped by a model that hasn't learned the clinical context.&lt;/p&gt;

&lt;p&gt;Drug names are particularly treacherous because every medication has at least two names — brand and generic — and many have common abbreviations on top of that. A single medication might be referred to as "acetaminophen," "Tylenol," or "APAP" in the same conversation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Abbreviations with multiple meanings&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;"PT" means physical therapy in an orthopedic note and prothrombin time in a lab report. "MS" could be multiple sclerosis, morphine sulfate, or mitral stenosis. "CA" might refer to cancer, calcium, or cardiac arrest depending on the specialty. Without understanding the clinical context, an AI model has no way to expand these abbreviations correctly — or even know whether to expand them at all.&lt;/p&gt;

&lt;p&gt;This problem extends well beyond medicine. In software engineering, "GC" could mean garbage collection or Google Cloud. In legal proceedings, "motion" has a specific procedural meaning that general models might not preserve. Domain abbreviations are everywhere, and they're inherently ambiguous without context.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Rapid dictation and environmental noise&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Clinicians don't dictate like news anchors. They speak fast, often while multitasking — walking between patient rooms, reviewing charts, or performing procedures. The speech patterns are clipped, full of self-corrections, and loaded with jargon that comes out in rapid bursts. Add to that the background noise of a busy clinical environment — beeping monitors, overlapping conversations, equipment sounds — and you've got audio conditions that push any model to its limits.&lt;/p&gt;

&lt;p&gt;Technical meetings have their own version of this problem. Engineers talking over each other about system architecture, rattling off API names and version numbers, switching between code terminology and plain English mid-sentence. The combination of speed, noise, and vocabulary complexity creates a perfect storm for transcription errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;General models aren't built for this&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The thing is, most ASR models are optimized to minimize overall Word Error Rate across general-purpose audio. They learn statistical patterns from large datasets of conversational speech. A word like "metformin" might appear thousands of times less frequently than "met for men" in general training data, so the model defaults to the more statistically likely interpretation. The model isn't wrong from a probability standpoint — it just doesn't have the domain knowledge to know that "met for men" makes no sense in a clinical context.&lt;/p&gt;

&lt;p&gt;This is why specialized approaches matter. General accuracy and domain accuracy are fundamentally different problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to measure accuracy on specialized terms&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before you can improve accuracy on technical vocabulary, you need to know how to measure it properly. And the standard metric most vendors advertise — Word Error Rate — doesn't tell you what you actually need to know.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The problem with Word Error Rate&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Word Error Rate (WER) calculates the percentage of words that are wrong in a transcript: (substitutions + deletions + insertions) / total words x 100. A transcript with 5% WER sounds impressive — 95 out of every 100 words are correct. But WER treats every word equally. Missing the word "um" gets the same penalty as changing "15 mg" to "50 mg."&lt;/p&gt;

&lt;p&gt;Consider this: a transcript could score 98% on WER while containing a single error that changes "no known allergies" to "known allergies." That's one deleted word out of hundreds, barely a blip in the WER calculation, but it's a potentially fatal mistake in an emergency room.&lt;/p&gt;

&lt;p&gt;WER is useful for comparing overall model quality, but it's a poor proxy for clinical or domain accuracy. You need metrics that weight errors by their actual impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Missed Entity Rate: a better measure for domain accuracy&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Missed Entity Rate (MER) specifically measures how often a model fails to correctly transcribe named entities — drug names, dosages, proper nouns, technical terms, and other domain-critical vocabulary. This metric focuses on the words that actually matter for downstream decision-making.&lt;/p&gt;

&lt;p&gt;So when you're evaluating &lt;a href="https://www.assemblyai.com/blog/best-medical-speech-to-text" rel="noopener noreferrer"&gt;medical speech-to-text software&lt;/a&gt; for technical or medical use cases, MER gives you a much clearer picture than WER alone. A model with slightly higher WER but significantly lower MER is almost always the better choice for domain applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How the models actually compare&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AssemblyAI's &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; with Medical Mode delivers a 3.2% Missed Entity Rate on medical terminology, compared to Deepgram Nova-3 Medical at 8.7% and AWS Transcribe Medical at 24.4%. On Word Error Rate for medical audio, the numbers are 5.3% for AssemblyAI versus 5.9% for Deepgram and 12.9% for AWS.&lt;/p&gt;

&lt;p&gt;These benchmarks matter because they're measured on real clinical audio, not cherry-picked samples. The gap between providers on &lt;a href="https://www.assemblyai.com/blog/medical-transcription-accuracy" rel="noopener noreferrer"&gt;medical terminology accuracy&lt;/a&gt; is substantial — nearly 3x between AssemblyAI and AWS on MER. When you're building a product where medication names and dosages need to be right, that difference is the entire product.&lt;/p&gt;

&lt;p&gt;For non-medical technical domains, the picture is similar. Universal-3 Pro achieves the lowest missed entity rates across categories including names, locations, organizations, emails, URLs, and phone numbers when compared against Amazon Transcribe, Deepgram Nova 3, ElevenLabs Scribe 2, Microsoft Azure, and OpenAI GPT-4o Transcribe.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Tools for improving accuracy on domain-specific terms&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Knowing the problem exists is step one. Here's how to actually fix it. AssemblyAI provides several features specifically designed to improve transcription accuracy on technical and domain-specific vocabulary. Each one targets a different aspect of the problem, and they can be combined for maximum effect.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Medical Mode&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/medical-mode" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt; is a purpose-built add-on that enhances transcription accuracy for medical terminology — medication names, procedures, conditions, and dosages. It reduces missed medical entities by over 20% compared to Universal-3 Pro alone, and it's optimized specifically for medical entity recognition to correct terms that general models frequently get wrong.&lt;/p&gt;

&lt;p&gt;You enable it by setting a single parameter: domain="medical-v1". No changes to your existing pipeline are required.&lt;/p&gt;

&lt;p&gt;Medical Mode supports English, Spanish, German, and French, and works with all of AssemblyAI's pre-recorded and &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;streaming speech-to-text&lt;/a&gt; models. It's billed as a separate add-on at $0.15/hr.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import assemblyai as aai

aai.settings.api_key = ""

audio_file = "https://assembly.ai/lispro"

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    language_detection=True,
    domain="medical-v1",
)

transcript = aai.Transcriber().transcribe(audio_file, config)

print(transcript.text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The difference is immediately visible. Here's a real before-and-after example:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without Medical Mode:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I have here insulin to be used for both prandial mealtime and sliding scale is — insulin lisprohumalog subcutaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With Medical Mode:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I have here insulin to be used for both prandial mealtime and sliding scale is — insulin Lispro (Humalog) subcutaneously.&lt;/p&gt;

&lt;p&gt;Medical Mode correctly formats the output following standard medical convention — generic name first, brand name in parentheses. That's not just cosmetic. It's the format clinicians expect and the format that reduces downstream errors in &lt;a href="https://www.assemblyai.com/blog/medical-transcription" rel="noopener noreferrer"&gt;clinical documentation&lt;/a&gt; systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Keyterms prompting&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Keyterms prompting lets you provide up to 1,000 words or phrases (maximum 6 words per phrase) to improve transcription accuracy for those specific terms and related variations. This is your go-to tool when you know which domain-specific words matter most for your use case.&lt;/p&gt;

&lt;p&gt;The key insight: you don't need to list every possible term. Keyterms prompting doesn't just match exact strings — it helps the model understand the semantic context around those terms, improving recognition of related terminology and contextually similar phrases as well.&lt;/p&gt;

&lt;p&gt;Start with no keyterms and add terms based on words you consistently see the model struggle with. Including too many common terms that are already well-represented in the training data can lead to overcorrections.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import assemblyai as aai

aai.settings.api_key = ""

audio_file = "https://assembly.ai/wildfires.mp3"

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    language_detection=True,
)
config.set_custom_spelling(
  {
    "Gettleman": ["gettleman"],
    "SQL": ["Sequel"],
  }
)

transcript = aai.Transcriber(config=config).transcribe(audio_file)

if transcript.status == "error":
  raise RuntimeError(f"Transcription failed: {transcript.error}")

print(transcript.text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This approach is particularly effective for proper nouns with unusual spellings, company-specific terminology, product names, and technical abbreviations that have domain-specific meanings.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Prompting with Universal-3 Pro&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Universal-3 Pro is a Speech-augmented Large Language Model (SpeechLLM) — which means it responds to natural language prompts that guide how it transcribes. You can use the prompt parameter to improve entity accuracy and provide domain context.&lt;/p&gt;

&lt;p&gt;For improving accuracy on technical terminology across any domain, use a prompt that describes the pattern of entities you want corrected:  &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Use standard spelling and the most contextually correct spelling of allwords including names, brands, drug names, medical terms, and proper nouns.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For providing domain-specific context that helps the model make better decisions about ambiguous terms, pair that with a context clue:  &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This is a doctor-patient visit. Prioritize accurately transcribingmedications and diseases wherever possible.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;But here's where it gets interesting: context alone doesn't tell the model how to transcribe. "This is a doctor-patient visit" is context. "Prioritize accurately transcribing medications and diseases" is the actionable instruction. You need both. The context sets the domain; the instruction tells the model what to prioritize within that domain.&lt;/p&gt;

&lt;p&gt;A few important prompting principles to keep in mind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Describe the pattern of entities you want corrected, not specific errors — listing exact spellings often causes the model to hallucinate them&lt;/li&gt;
&lt;li&gt;If you know the exact terms you need, use keyterms prompting rather than describing them in a free-form prompt&lt;/li&gt;
&lt;li&gt;Start with the default prompt (which is already optimized for accuracy) and add one instruction at a time&lt;/li&gt;
&lt;li&gt;Use authoritative language — "Required:", "Mandatory:", and "Always:" get higher compliance than softer phrasing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Combining features for maximum accuracy&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;These features aren't mutually exclusive. For the highest possible accuracy on medical or technical audio, combine Medical Mode, keyterms prompting, and &lt;a href="https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work" rel="noopener noreferrer"&gt;speaker diarization&lt;/a&gt; in a single configuration:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    language_detection=True,
    domain="medical-v1",
    speaker_labels=True,
    keyterms_prompt=["Lisinopril", "Metformin", "Humalog"],
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This configuration gives you Medical Mode for broad medical entity recognition, keyterms prompting for specific drugs or terms unique to your use case, and speaker diarization to correctly attribute who said what — critical in clinical conversations where the difference between a patient reporting a symptom and a doctor noting a finding completely changes the medical meaning.&lt;/p&gt;

&lt;p&gt;For streaming applications, the same combination works. You can even update keyterms dynamically mid-stream as the conversation progresses — for example, switching from scheduling-related terms to clinical terms when a voice agent moves from appointment booking to a medical intake stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What about non-English languages?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Technical accuracy challenges get amplified when you add language diversity into the mix. Many speech-to-text providers see significant accuracy drops on non-English audio, especially for domain-specific terminology that may not appear frequently in multilingual training data.&lt;/p&gt;

&lt;p&gt;Universal-3 Pro supports English, Spanish, Portuguese, French, German, and Italian natively with code-switching — meaning it can handle audio where speakers switch between languages mid-conversation without requiring separate model configurations. For access to all 99 supported languages, use "speech_models": ["universal-3-pro", "universal-2"], which falls back to Universal-2 for languages Universal-3 Pro doesn't yet cover.&lt;/p&gt;

&lt;p&gt;Medical Mode specifically supports English, Spanish, German, and French for medical terminology enhancement. If you use Medical Mode with an unsupported language, the API ignores the domain parameter gracefully — your transcript is still returned using standard transcription, and you won't be charged for Medical Mode.&lt;/p&gt;

&lt;p&gt;For improving transcript accuracy on non-English technical content, the same strategies apply: use keyterms prompting for domain-specific terms in the target language, and use prompting to provide language-specific context. You can even prepend "Transcribe [language]" to your prompt to guide the model toward a specific language when you know it in advance.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to improve accuracy on poor-quality audio&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Even the best model can only work with the audio it receives. Poor recording conditions — compressed phone audio, background noise, far-field microphones, overlapping speakers — degrade accuracy on all vocabulary, but technical terms suffer disproportionately because they're already at the edge of the model's confidence.&lt;/p&gt;

&lt;p&gt;A few practical strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Invest in input quality:&lt;/strong&gt; High-quality microphones and noise-canceling technology make a measurable difference. For medical dictation workflows, this is one of the highest-ROI investments you can make.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use keyterms prompting aggressively:&lt;/strong&gt; When audio quality is poor, giving the model explicit guidance about which terms to expect helps it resolve ambiguous acoustic signals in favor of the correct domain terms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjust silence thresholds for medical audio:&lt;/strong&gt; Clinical conversations have different speech patterns than typical voice interactions. Doctors pause to think, review charts, or formulate diagnoses. Increasing silence thresholds (e.g., min_turn_silence: 800, max_turn_silence: 3600) prevents the model from fragmenting these natural pauses into separate turns, which can break context and reduce accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Combine multiple accuracy features:&lt;/strong&gt; Medical Mode + keyterms + prompting together provide more resilience against poor audio than any single feature alone, because each feature addresses a different source of error.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Real-world applications&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The techniques we've covered aren't theoretical. Here's how they play out across industries that depend on accurate transcription of specialized vocabulary.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Medical scribes and clinical documentation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Ambient clinical documentation is the fastest-growing application for &lt;a href="https://www.assemblyai.com/blog/medical-speech-to-text" rel="noopener noreferrer"&gt;medical speech-to-text&lt;/a&gt;. AI scribes listen during patient encounters and generate structured clinical notes — SOAP notes, discharge summaries, referral letters. The accuracy requirements are the highest in any industry because errors directly affect patient care.&lt;/p&gt;

&lt;p&gt;Medication names and dosages are the critical path. Getting "Ramipril 5 mg daily" right is what makes the note usable. Getting it wrong creates a documentation error that follows the patient through their entire care journey. Building an &lt;a href="https://www.assemblyai.com/blog/build-an-ai-medical-scribe-speech-to-text" rel="noopener noreferrer"&gt;AI medical scribe&lt;/a&gt; that clinicians trust requires Medical Mode combined with keyterms prompting for a practice's common formulary.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Legal transcription&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Legal proceedings have their own specialized vocabulary — case citations like "Duran v. Peabody Coal Company," Latin terms like "amicus curiae" and "voir dire," and procedural language like "motion for summary judgment" that has precise legal meaning. A deposition transcript that mangles case citations is useless for legal research.&lt;/p&gt;

&lt;p&gt;Keyterms prompting is the primary tool here. Legal teams can provide the specific case names, legal terms, and proper nouns they expect to appear, and the model adjusts its recognition accordingly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Technical meetings and engineering discussions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Product names, API endpoints, version numbers, acronyms — engineering conversations are dense with terminology that general models struggle with. "We need to migrate the CloudGuard SSO integration to v3.2" is the kind of sentence where every technical term matters and none of them appear in general conversational training data.&lt;/p&gt;

&lt;p&gt;Custom spelling lets you enforce exact formatting for your product vocabulary — ensuring "CloudGuard" stays as "CloudGuard" instead of becoming "cloud guard" or "Cloudguard." Keyterms prompting handles the broader technical vocabulary.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Contact centers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Contact centers process thousands of calls daily, and the critical information is often the most domain-specific: account numbers, product names, company-specific terminology, policy references. When a customer says their policy number or a specific product name, that entity needs to be captured exactly right for downstream analytics, compliance monitoring, and automated workflows to function. Effective &lt;a href="https://www.assemblyai.com/blog/conversation-intelligence" rel="noopener noreferrer"&gt;conversation intelligence&lt;/a&gt; depends on getting these entities right.&lt;/p&gt;

&lt;p&gt;The combination of keyterms prompting (for company-specific terms) and dynamic mid-stream updates (adjusting terms as the call progresses through different stages) gives contact center applications the flexibility to maintain high accuracy across diverse call types.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Looking forward&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The gap between AI transcription and human transcription for domain-specific terminology is closing fast — but it's closing because of purpose-built features, not because general models are magically getting better at rare vocabulary. Medical Mode, keyterms prompting, and SpeechLLM prompting represent a fundamentally different approach than trying to train a single model on everything.&lt;/p&gt;

&lt;p&gt;What's changing is that these specialized capabilities are becoming easier to access. A few years ago, getting clinical-grade transcription accuracy meant building custom models, maintaining specialized vocabularies, and running expensive infrastructure. Now it's a single parameter: domain="medical-v1". The complexity is moving from the developer's plate into the platform.&lt;/p&gt;

&lt;p&gt;For teams building products that depend on accurate transcription of specialized vocabulary — whether that's medication names, legal citations, or engineering jargon — the most important decision isn't which model has the best overall WER. It's whether your speech-to-text provider gives you the tools to optimize for the specific terms that matter in your domain.&lt;/p&gt;

&lt;p&gt;The accuracy is there. The tools exist. The question is whether you're using them.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Which speech-to-text API has the highest accuracy for technical terminology?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AssemblyAI's Universal-3 Pro delivers the lowest Word Error Rate on English audio at 5.9% and achieves the best entity recognition accuracy across categories including names, locations, medical terms, emails, URLs, and phone numbers. For medical terminology specifically, Universal-3 Pro with Medical Mode achieves a 3.2% Missed Entity Rate — compared to 8.7% for Deepgram Nova-3 Medical and 24.4% for AWS Transcribe Medical. Keyterms prompting lets you boost accuracy for up to 1,000 domain-specific terms per request.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I improve transcript accuracy for poor-quality audio?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Start with the highest quality input you can get — invest in good microphones and noise-canceling technology. Then layer AssemblyAI's accuracy features: use keyterms prompting to give the model guidance on which domain terms to expect, enable Medical Mode if you're working with clinical audio, and use prompting to provide context about the audio domain. For streaming medical audio, increase silence thresholds to prevent premature turn boundaries that break context. Combining multiple features provides more resilience against poor audio than any single approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How does AssemblyAI compare to Deepgram for medical transcription accuracy?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;On medical entity recognition, AssemblyAI Universal-3 Pro with Medical Mode achieves a 3.2% Missed Entity Rate versus Deepgram Nova-3 Medical at 8.7% — meaning AssemblyAI misses significantly fewer medication names, dosages, and clinical terms. On Word Error Rate for medical audio, AssemblyAI delivers 5.3% versus Deepgram's 5.9%. Medical Mode is available for both pre-recorded and streaming transcription, supports four languages, and combines with keyterms prompting and speaker diarization for clinical documentation workflows. AssemblyAI offers a Business Associate Agreement (BAA) for customers who need to process Protected Health Information (PHI).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How accurate is AI transcription for non-English languages?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Universal-3 Pro supports six languages natively (English, Spanish, Portuguese, French, German, Italian) with code-switching, meaning it handles multilingual audio where speakers switch languages mid-conversation. For broader language coverage, using "speech_models": ["universal-3-pro", "universal-2"] provides access to 99 languages. Medical Mode supports English, Spanish, German, and French for medical terminology. For non-English technical content, keyterms prompting works across all supported languages to boost recognition of domain-specific terms.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I inject custom vocabulary or domain-specific terms in transcripts?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AssemblyAI provides three approaches. &lt;strong&gt;Keyterms prompting&lt;/strong&gt; lets you pass up to 1,000 domain terms that the model prioritizes during transcription — this is the most effective method for boosting recognition of specific words. &lt;strong&gt;Custom spelling&lt;/strong&gt; uses a find-and-replace approach to enforce exact formatting of terms in the final transcript (e.g., ensuring "SQL" renders as "Sequel"). &lt;strong&gt;Prompting&lt;/strong&gt; with Universal-3 Pro provides natural language instructions that set domain context and guide transcription style. For maximum accuracy, combine keyterms with Medical Mode or prompting rather than relying on any single feature.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Does AssemblyAI support HIPAA requirements for medical transcription?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process protected health information (PHI). AssemblyAI offers a Business Associate Agreement (BAA) and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. Medical Mode does not change existing data handling or retention policies. For BAA setup or enterprise pricing, contact the AssemblyAI sales team.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>speechrecognition</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How is speaker embedding used in voice recognition for transcripts?</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:48:16 +0000</pubDate>
      <link>https://dev.to/martschweiger/how-is-speaker-embedding-used-in-voice-recognition-for-transcripts-4p60</link>
      <guid>https://dev.to/martschweiger/how-is-speaker-embedding-used-in-voice-recognition-for-transcripts-4p60</guid>
      <description>&lt;p&gt;Record a meeting with four people and hit transcribe. What you get back is a wall of text—every word captured, but no way to tell who said what. It's like reading a screenplay where someone erased all the character names. You can figure it out if you squint and cross-reference, but that defeats the entire purpose of automated transcription.&lt;/p&gt;

&lt;p&gt;Speaker embedding is the technology that solves this. It's the mechanism behind the "who spoke when?" capability you see in modern &lt;a href="https://www.assemblyai.com/products/speech-to-text" rel="noopener noreferrer"&gt;speech-to-text&lt;/a&gt; systems. And understanding how it works isn't just academic—it directly impacts the quality of transcripts you ship in production.&lt;/p&gt;

&lt;p&gt;This article breaks down exactly how speaker embeddings power voice recognition in transcripts, walks through the full diarization pipeline, compares the main architectural approaches, and shows you how to implement it with working code.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What are speaker embeddings?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A speaker embedding is a high-dimensional numerical representation of someone's unique vocal characteristics. Think of it as a mathematical fingerprint for a voice—a compact vector that captures everything distinctive about how a person sounds.&lt;/p&gt;

&lt;p&gt;What goes into that fingerprint? Pitch, timbre, cadence, speaking rhythm, resonance patterns, the shape of vowel formants, even the way someone transitions between consonants. Your fundamental frequency typically sits between 85–180 Hz if you're male and 165–255 Hz if you're female, but the embedding captures far more than just pitch. It encodes how energy distributes across different frequencies, your prosodic patterns (where you place stress in sentences, how your intonation rises and falls), and the spectral characteristics that result from your unique vocal tract shape.&lt;/p&gt;

&lt;p&gt;The concept has roots in earlier speaker recognition research. I-vectors found early success by mapping variable-length audio segments to fixed-length vectors in a total variability space. They worked, but they had limitations—particularly with short audio segments and noisy conditions.&lt;/p&gt;

&lt;p&gt;Modern approaches use neural network-based audio embeddings called d-vectors. Instead of statistical models, a deep neural network learns to produce embeddings that cluster similar voices together and push different voices apart in the embedding space. The result is dramatically better performance, especially on the short utterances and messy real-world audio that i-vectors struggled with.&lt;/p&gt;

&lt;p&gt;Here's the conceptual pipeline at a high level:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audio segments go in&lt;/li&gt;
&lt;li&gt;An AI model processes each segment&lt;/li&gt;
&lt;li&gt;Embedding vectors come out&lt;/li&gt;
&lt;li&gt;Similar embeddings get clustered together—each cluster represents one speaker&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's the 30-second version. The actual implementation involves four distinct stages, and each one matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The four-step diarization pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work" rel="noopener noreferrer"&gt;Speaker diarization&lt;/a&gt;—the process of determining "who spoke when" in an audio recording—relies on speaker embeddings as its core technology. The full pipeline involves four steps that work together to transform raw audio into speaker-labeled transcripts.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1: Audio segmentation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The first step breaks the audio into individual utterances. These are typically between 0.5 and 10 seconds of speech, segmented based on silence gaps, punctuation markers, and acoustic changes like shifts in tone or pitch.&lt;/p&gt;

&lt;p&gt;Why not just process the whole file at once? Because a single word isn't enough context for even a human to identify a speaker, let alone an AI model. The system needs enough audio to extract meaningful vocal characteristics, but not so much that multiple speakers end up in the same segment.&lt;/p&gt;

&lt;p&gt;There's an important accuracy threshold here. Research shows that diarization accuracy drops measurably when utterances are under one second. The optimal range sits between 1 and 10 seconds per utterance, with 0.5 seconds as the minimum for basic detection. In streaming diarization, if a turn contains less than approximately one second of audio, it may be labeled as "UNKNOWN" because there isn't enough signal to generate a reliable embedding.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2: Speaker embedding generation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Each utterance now passes through a deep learning model that's been trained specifically to produce embeddings capturing unique vocal characteristics. The model examines spectral features, frequency patterns, vocal tract resonance, and temporal speaking patterns—then compresses all of that into a numerical vector.&lt;/p&gt;

&lt;p&gt;The key insight is that this model has been trained on massive datasets of labeled speech, so it's learned which acoustic features actually distinguish one speaker from another and which features are just noise. Two recordings of the same person saying completely different words should produce similar embeddings. Two different people saying the exact same words should produce different embeddings.&lt;/p&gt;

&lt;p&gt;This is where the quality of the embedding model matters enormously. A better model means tighter clusters for same-speaker segments and wider separation between different speakers—which directly translates to more accurate transcripts.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3: Speaker count estimation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here's where it gets interesting. Modern diarization models automatically predict the number of speakers in a recording. Legacy systems required you to specify this upfront ("there are four speakers in this meeting"), but that's rarely practical in production—you often don't know how many people will speak.&lt;/p&gt;

&lt;p&gt;The strategy is counterintuitive but effective: overestimate first, then merge. The system initially estimates the highest number of speakers that could reasonably be present. Why? Because it's much easier to combine the utterances of one speaker that's been incorrectly split into two than it is to disentangle two speakers that have been incorrectly merged into one. Splitting is reversible; merging often isn't.&lt;/p&gt;

&lt;p&gt;After the initial overestimate, the system goes back and combines or separates speakers as needed to arrive at an accurate count. AssemblyAI's diarization achieves a 2.9% speaker count error rate—meaning it correctly identifies the number of speakers in 97.1% of audio files.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 4: Clustering and assignment&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, the embeddings get clustered into groups based on similarity. If the model predicts four speakers, it forces the embeddings into four groups. Each cluster represents a unique speaker.&lt;/p&gt;

&lt;p&gt;Picture it as dots on a chart. Each dot is an utterance's embedding. Utterances from the same speaker naturally cluster together because their embeddings are similar in the high-dimensional space. The clustering algorithm identifies these natural groupings and assigns speaker labels—Speaker A, Speaker B, and so on—to each cluster.&lt;/p&gt;

&lt;p&gt;There are multiple ways to determine embedding similarity, and this is a core component of accurate speaker label prediction. Two common approaches are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;K-Means Clustering&lt;/strong&gt; —Uses K-Means++ initialization to determine speaker count, measuring the conditional Mean Squared Cosine Distances between each embedding and its cluster centroid&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spectral Clustering&lt;/strong&gt; —Constructs an affinity matrix, performs refinement operations, then uses eigen-decomposition and K-Means on the resulting embeddings to produce speaker labels&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After this step, you have a complete transcription with accurate speaker labels. The labels remain consistent—Speaker A stays Speaker A throughout the entire recording.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pipeline-based vs. end-to-end approaches&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The four-step pipeline described above represents the traditional approach. But there's a fundamentally different architecture gaining ground. Understanding both helps you make informed decisions about what to build on.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Pipeline-based (clustering) systems&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The pipeline approach treats diarization as a multi-stage process where each component handles a specific task in sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Voice Activity Detection (VAD)&lt;/strong&gt; —Identifies which parts of the audio contain speech versus silence or background noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Segmentation&lt;/strong&gt; —Divides speech regions into uniform chunks for processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding extraction&lt;/strong&gt; —Generates numerical representations that capture unique voice characteristics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt; —Groups similar embeddings together, with each cluster representing a unique speaker&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The advantages are clear: transparent processing, stage-specific optimization, and easier debugging. When something goes wrong, you can isolate exactly which stage failed and fix it independently.&lt;/p&gt;

&lt;p&gt;The downside? Error propagation. Mistakes in early stages cascade through the entire pipeline. If the VAD misses a speech segment, no amount of perfect clustering downstream can recover it.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;End-to-end neural systems&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;End-to-end systems use a single neural network to map raw audio directly to speaker-labeled segments without explicit intermediate stages. Often built on transformer architectures, these models learn the entire diarization process as a unified problem.&lt;/p&gt;

&lt;p&gt;The result is better handling of scenarios that pipeline systems historically struggle with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overlapping speech where two people talk simultaneously&lt;/li&gt;
&lt;li&gt;Subtle voice changes between speakers with similar vocal characteristics&lt;/li&gt;
&lt;li&gt;Brief utterances that don't contain enough audio for reliable embedding extraction in isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is less interpretability. When an end-to-end model makes an error, it's harder to diagnose why. You can't open the hood and point to a specific stage that broke.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Real-world performance gains&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The quality of the speaker embedding model at the center of either approach has a massive impact on overall accuracy. AssemblyAI's improved in-house speaker embedding model demonstrates this clearly—it achieved a 30% improvement in diarization accuracy for noisy and far-field audio scenarios, with error rates dropping from 29.1% to 20.4% in challenging conditions.&lt;/p&gt;

&lt;p&gt;The improvements extend to edge cases that previously undermined transcript quality:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Audio condition&lt;/th&gt;
&lt;th&gt;Segment length&lt;/th&gt;
&lt;th&gt;Previous model&lt;/th&gt;
&lt;th&gt;New model&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Clean audio&lt;/td&gt;
&lt;td&gt;Very short (250ms)&lt;/td&gt;
&lt;td&gt;18.8%&lt;/td&gt;
&lt;td&gt;16.4%&lt;/td&gt;
&lt;td&gt;13% better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Clean audio&lt;/td&gt;
&lt;td&gt;Short (500ms)&lt;/td&gt;
&lt;td&gt;10.4%&lt;/td&gt;
&lt;td&gt;6.4%&lt;/td&gt;
&lt;td&gt;38% better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noisy audio&lt;/td&gt;
&lt;td&gt;Very short (250ms)&lt;/td&gt;
&lt;td&gt;46.8%&lt;/td&gt;
&lt;td&gt;26.4%&lt;/td&gt;
&lt;td&gt;44% better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noisy audio&lt;/td&gt;
&lt;td&gt;Short (500ms)&lt;/td&gt;
&lt;td&gt;18.4%&lt;/td&gt;
&lt;td&gt;14.4%&lt;/td&gt;
&lt;td&gt;22% better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reverberant audio&lt;/td&gt;
&lt;td&gt;Mid-length (1.5s)&lt;/td&gt;
&lt;td&gt;15.2%&lt;/td&gt;
&lt;td&gt;4.4%&lt;/td&gt;
&lt;td&gt;71% better&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Noise + reverb&lt;/td&gt;
&lt;td&gt;Short (500ms)&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;td&gt;22.8%&lt;/td&gt;
&lt;td&gt;43% better&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That 85.4% reduction in speaker count errors is particularly significant. Phantom speaker detections—where the model incorrectly identifies noise or acoustic artifacts as additional speakers—were one of the most frustrating failure modes for developers. Getting speaker count wrong doesn't just produce messy transcripts; it breaks downstream features that depend on knowing exactly how many participants were in a conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to use speaker diarization with AssemblyAI&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Theory is useful, but you're probably here to build something. Here's how to implement speaker diarization and get speaker-labeled transcripts using AssemblyAI's API.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Basic diarization with Python&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The simplest implementation requires just a few lines. Set speaker_labels=True in your transcription config, and the API handles the entire embedding and clustering pipeline for you:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import assemblyai as aai

aai.settings.api_key = ""

audio_file = "https://assembly.ai/wildfires.mp3"

config = aai.TranscriptionConfig(
  speech_models=["universal-3-pro", "universal-2"],
  language_detection=True,
  speaker_labels=True,
)

transcript = aai.Transcriber().transcribe(audio_file, config)

for utterance in transcript.utterances:
  print(f"Speaker {utterance.speaker}: {utterance.text}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The response includes a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker. Each utterance object contains the speaker label, the transcribed text, and confidence scores.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Setting a speaker range&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When you know approximately how many speakers to expect, you can help the model by specifying a range. This is useful for scenarios like &lt;a href="https://www.assemblyai.com/solutions/contact-centers" rel="noopener noreferrer"&gt;call center recordings&lt;/a&gt; (usually two speakers) or panel discussions (three to five speakers):&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = aai.TranscriptionConfig(
  speech_models=["universal-3-pro", "universal-2"],
  language_detection=True,
  speaker_labels=True,
  speaker_options=aai.SpeakerOptions(
    min_speakers_expected=3,
    max_speakers_expected=5
  ),
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;A word of caution: only set max_speakers_expected higher than the default when you actually need it. Setting it unnecessarily high can hurt model accuracy because the clustering algorithm has a larger search space to explore.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;JavaScript SDK&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The same functionality is available in JavaScript:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const client = new AssemblyAI({
  apiKey: "",
});

const audioFile = "https://assembly.ai/wildfires.mp3";

const params = {
  audio: audioFile,
  speech_models: ["universal-3-pro", "universal-2"],
  language_detection: true,
  speaker_labels: true,
};

const run = async () =&amp;gt; {
  const transcript = await client.transcripts.transcribe(params);

  for (const utterance of transcript.utterances ?? []) {
    console.log(`Speaker ${utterance.speaker}: ${utterance.text}`);
  }
};

run();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Both SDKs handle the full lifecycle—uploading audio, waiting for processing, and returning structured results with speaker labels. The speaker embedding generation, clustering, and label assignment all happen server-side.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Optimizing speaker embedding accuracy&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Getting speaker diarization working is straightforward. Getting it working &lt;em&gt;well&lt;/em&gt; across diverse audio conditions takes some attention to detail. Here are the factors that have the biggest impact on embedding quality and diarization accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Provide the expected speaker count when you know it&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you know how many speakers are in the recording, tell the model. Use speakers_expected for an exact count, or speaker_options with min_speakers_expected and max_speakers_expected for a range. This is critical because the speaker count estimation step directly influences clustering quality—and giving the model a head start eliminates an entire category of potential errors.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Audio quality matters more than you think&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Speaker embeddings are derived from acoustic features. If those features are corrupted by noise, compression artifacts, or low sample rates, the embeddings themselves will be less discriminative. For best results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Record at 16kHz or higher sample rate&lt;/li&gt;
&lt;li&gt;Minimize background noise where possible&lt;/li&gt;
&lt;li&gt;Use directional microphones that reduce cross-talk between speakers&lt;/li&gt;
&lt;li&gt;Avoid heavy audio compression that strips high-frequency information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That said, AssemblyAI's improved speaker embedding model is specifically designed to handle real-world audio conditions. The 30% improvement in noisy environments means you don't need studio-quality recordings to get reliable results—but cleaner audio still produces tighter embeddings and more accurate speaker separation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Consider multichannel audio for perfect separation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If your recording setup captures each speaker on a separate audio channel—like a call center system with separate agent and customer channels—you can get perfect speaker separation without diarization at all. Multichannel transcription gives you guaranteed accuracy because the channel itself defines the speaker.&lt;/p&gt;

&lt;p&gt;Note that Speaker Diarization and multichannel transcription are mutually exclusive in the API. You can't enable both simultaneously—choose the approach that fits your audio source.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Streaming diarization for real-time use cases&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Speaker diarization isn't limited to pre-recorded audio. &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;Streaming Diarization&lt;/a&gt; is available on all streaming models, including &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; Streaming. Enable it by adding speaker_labels: true to your connection parameters, and each turn event includes a speaker_label field identifying the dominant speaker.&lt;/p&gt;

&lt;p&gt;One thing to know about streaming: speaker accuracy improves over the course of a session as the model accumulates embedding context. Early turns may be less stable, but the model builds richer speaker profiles as more audio flows in. For long-form conversations like call center calls or &lt;a href="https://www.assemblyai.com/blog/ambient-ai-scribe" rel="noopener noreferrer"&gt;clinical scribes&lt;/a&gt;, the model settles into accurate, stable labels well before the conversation ends.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key accuracy benchmarks&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When evaluating speaker diarization quality, the industry-standard metric is Diarization Error Rate (DER)—the percentage of time incorrectly attributed to speakers, combining false alarms, missed speech, and speaker confusion errors. Lower is better.&lt;/p&gt;

&lt;p&gt;AssemblyAI achieves a 2.9% speaker count error rate on its evaluation benchmarks, with performance metrics based on evaluation across 205+ hours of audio including meeting recordings, &lt;a href="https://www.assemblyai.com/blog/choosing-a-stt-api-for-voice-agents" rel="noopener noreferrer"&gt;call center conversations&lt;/a&gt;, and challenging acoustic environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What's next for speaker embeddings&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Speaker embedding technology is evolving fast, and the trajectory points toward capabilities that go well beyond single-recording diarization.&lt;/p&gt;

&lt;p&gt;Speaker fingerprinting—the ability to create persistent voice signatures that identify the same person across separate recordings and sessions—is the natural extension of embedding technology. Where diarization tells you "Speaker A and Speaker B are different people in this recording," fingerprinting tells you "Speaker A in today's meeting is the same person as Speaker B from last week's call." The underlying technology is the same: extract stable vocal features, produce embeddings, compare similarity. But the applications open up dramatically when you can track speakers across time.&lt;/p&gt;

&lt;p&gt;Think about what that enables: sales platforms tracking how a specific rep's &lt;a href="https://www.assemblyai.com/blog/conversation-intelligence" rel="noopener noreferrer"&gt;conversation patterns&lt;/a&gt; evolve over months, compliance systems that verify speaker identity across recorded interactions, meeting analytics that automatically attribute contributions to named participants without manual labeling.&lt;/p&gt;

&lt;p&gt;The embedding models powering these capabilities continue to improve, with recent advances pushing reliable speaker identification down to audio segments as short as 250ms. As embeddings get more robust to noise, emotion, and the natural variability of human voices, the gap between "who spoke in this recording" and "who is this person" will continue to narrow.&lt;/p&gt;

&lt;p&gt;If you're building &lt;a href="https://www.assemblyai.com/blog/ai-voice-agents" rel="noopener noreferrer"&gt;Voice AI applications&lt;/a&gt; that need accurate speaker-labeled transcripts, &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;try our API for free&lt;/a&gt;. Speaker diarization is included at no additional cost, works across 95 languages, and the latest embedding model improvements are available to all customers automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is a speaker embedding in speech recognition?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A speaker embedding is a high-dimensional numerical vector that captures the unique vocal characteristics of a speaker, including pitch, timbre, cadence, and formant frequencies. Modern systems generate these embeddings using deep neural networks, producing what are known as d-vectors. Speaker embeddings are the core technology powering speaker diarization, enabling systems to distinguish between different voices in an audio recording.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How does speaker diarization use embeddings to identify who spoke?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Speaker diarization follows a four-step pipeline: audio segmentation, embedding extraction, speaker count estimation, and clustering. The system extracts an embedding vector from each speech segment, then groups similar embeddings together so that each cluster represents one unique speaker. AssemblyAI's diarization achieves a 2.9% speaker count error rate, meaning it correctly identifies the number of speakers in over 97% of audio files.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What's the difference between pipeline-based and end-to-end speaker diarization?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Pipeline-based diarization processes audio through separate stages: voice activity detection, segmentation, embedding extraction, and clustering. This approach is transparent and easier to debug since each stage can be optimized independently. End-to-end diarization uses a single neural network to map audio directly to speaker labels, which handles overlapping speech better but is less interpretable when errors occur. Both approaches rely on speaker embeddings as the core representation for distinguishing voices.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How accurate is speaker diarization on noisy or far-field audio?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Embedding quality naturally degrades with noise, reverberation, and distance from the microphone, but modern models are improving rapidly. AssemblyAI's improved embedding model achieved 30% better diarization accuracy on noisy and far-field audio, with error rates dropping from 29.1% to 20.4% in challenging conditions. For best results, record at 16kHz or higher sample rate and use directional microphones to reduce cross-talk between speakers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can I use speaker diarization in real-time streaming transcription?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes, AssemblyAI supports streaming diarization on all streaming models, including Universal-3 Pro Streaming. Enable it by setting speaker_labels: true in your connection parameters, and each turn event will include a speaker label identifying who is speaking. Accuracy improves over the course of a session as the model accumulates more embedding context from each speaker.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I implement speaker diarization with AssemblyAI's API?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Set speaker_labels=True in your TranscriptionConfig to enable diarization. You can optionally provide speaker_options with min_speakers_expected and max_speakers_expected to improve accuracy when you know the approximate number of participants. The feature is available in both the Python and JavaScript SDKs, and speaker diarization is included at no additional cost with your API usage.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>speechrecognition</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How does context influence automatic speaker labeling?</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:47:32 +0000</pubDate>
      <link>https://dev.to/martschweiger/how-does-context-influence-automatic-speaker-labeling-456b</link>
      <guid>https://dev.to/martschweiger/how-does-context-influence-automatic-speaker-labeling-456b</guid>
      <description>&lt;p&gt;&lt;a href="https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work" rel="noopener noreferrer"&gt;Speaker diarization&lt;/a&gt; gives you labels. Speaker A said this, Speaker B said that. It's a solid start—but it's only half the problem solved.&lt;/p&gt;

&lt;p&gt;In most real-world scenarios, generic labels aren't enough. You need to know that Speaker A is "Dr. Sarah Chen" and Speaker B is "the patient." Or that Speaker A is the sales rep and Speaker B is the prospect who just asked about pricing. Without that mapping, downstream analysis hits a wall—you can't track "what did the agent say" if you don't know which speaker &lt;em&gt;is&lt;/em&gt; the agent.&lt;/p&gt;

&lt;p&gt;Context changes everything. Both the audio content itself and the metadata you provide before transcription dramatically influence how accurately speakers get labeled. The right context turns anonymous speaker clusters into named, role-assigned participants you can actually analyze.&lt;/p&gt;

&lt;p&gt;This article breaks down the three types of context that improve speaker labeling, shows you exactly how to configure each one in AssemblyAI's API, and walks through the real-world use cases where context-driven labeling matters most.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How basic speaker diarization works&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we get into context, here's a quick recap of how diarization works under the hood. The process follows a consistent pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Segmentation&lt;/strong&gt; — The audio gets divided into time-based segments based on acoustic changes like pauses, tone shifts, and pitch variations. This creates boundaries where one speaker stops and another begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding extraction&lt;/strong&gt; — Each segment passes through an AI model that produces embeddings—numerical representations of a speaker's unique vocal characteristics, including pitch, formant frequencies, speaking rhythm, and voice timbre.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speaker count estimation&lt;/strong&gt; — The system predicts how many distinct speakers are present in the audio. Modern AI models do this automatically, unlike legacy systems that required you to specify the count upfront.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clustering&lt;/strong&gt; — The embeddings are grouped together based on similarity. Each cluster represents a distinct speaker, and all utterances in that cluster receive the same label.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result? A transcript where every utterance is tagged with a consistent speaker label—Speaker A, Speaker B, Speaker C—throughout the entire recording.&lt;/p&gt;

&lt;p&gt;That's useful. But those generic labels limit what you can do downstream. You can't run sentiment analysis on "what the customer said" if you don't know which speaker is the customer. You can't extract action items per participant if participants are just letters of the alphabet.&lt;/p&gt;

&lt;p&gt;This is where context comes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How context improves speaker labeling&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Context influences speaker labeling through three distinct channels. Each one gives the system additional information to work with, and they compound—using all three together produces the best results.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Audio context: what the conversation reveals&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The content of the conversation itself carries identity signals. When someone says "Hi, I'm Dr. Chen" at the start of a recording, that's a direct cue. When another participant responds with "Thanks, Doctor, I've been having this pain in my lower back," that confirms the relationship and roles.&lt;/p&gt;

&lt;p&gt;AssemblyAI's &lt;a href="https://www.assemblyai.com/docs/speech-understanding/speaker-identification" rel="noopener noreferrer"&gt;Speaker Identification&lt;/a&gt; feature analyzes these conversational cues to infer who's speaking. It doesn't require voice enrollment or pre-recorded samples. Instead, it uses the conversation content—names mentioned, roles described, conversational dynamics—to map generic speaker labels to the identifiers you provide.&lt;/p&gt;

&lt;p&gt;The thing is, this works even when introductions aren't explicit. Conversational patterns like "So, as your financial advisor, I'd recommend..." or "The defendant's counsel would like to object" give the model enough signal to assign the right roles.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Metadata context: information you provide before transcription&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is where you have the most control. Metadata you supply in the API request shapes how the model interprets what it hears. Three key types of metadata make a difference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Expected speaker count&lt;/strong&gt; — Telling the model how many speakers to expect (via speakers_expected or speaker_options) constrains the clustering step. Instead of guessing, the model knows it should find exactly 2, or between 2 and 5, distinct voices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speaker names and roles&lt;/strong&gt; — Through Speaker Identification, you can provide names, roles, and descriptions for each participant. The model uses these alongside conversational cues to replace generic labels with meaningful identifiers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio channel mapping&lt;/strong&gt; — For multichannel recordings (like &lt;a href="https://www.assemblyai.com/solutions/contact-centers" rel="noopener noreferrer"&gt;contact center&lt;/a&gt; calls with separate agent and customer channels), the channel assignment itself is metadata that provides perfect speaker separation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each piece of metadata reduces ambiguity. The more the model knows going in, the less it has to infer—and the fewer mistakes it makes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Structural context: how the audio is formatted&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The physical structure of the audio recording also influences labeling accuracy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multichannel recordings&lt;/strong&gt; give you perfect speaker separation without any diarization at all. If the agent is on channel 1 and the customer is on channel 2, there's zero ambiguity about who said what.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent turn-taking patterns&lt;/strong&gt; help the model. Clean back-and-forth conversations where speakers don't talk over each other produce more accurate embeddings and cleaner cluster boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio quality and microphone proximity&lt;/strong&gt; affect embedding accuracy directly. A speaker sitting close to the microphone produces clearer vocal features than someone across the room. Background noise, echoes, and cross-talk all degrade the model's ability to distinguish between voices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sufficient speech per speaker&lt;/strong&gt; matters too. Each speaker should ideally contribute at least 30 seconds of uninterrupted speech. The model struggles to create separate clusters for speakers who only contribute short phrases like "Yeah," "Right," or "Sounds good."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the format of your audio is itself a form of context. Stereo call recordings with clean separation are giving the system far more context than a single-channel recording from a conference room with eight people and an air conditioner running.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Speaker Identification: from generic labels to names&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Speaker Identification is the key feature that transforms context into named speakers. It replaces generic "Speaker A" and "Speaker B" labels with real names or roles—no voice enrollment needed. The system uses conversation content to infer who's speaking and applies the identifiers you provide.&lt;/p&gt;

&lt;p&gt;You have two main approaches: identify by name (when you know who's in the conversation) or identify by role (when you know the structure but not the specific people).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Role-based identification&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Role-based identification is the most common approach for contact centers, interviews, and any scenario where you know the structure of the conversation. Here's how to set it up with the Python SDK, based on &lt;a href="https://www.assemblyai.com/docs/contact-center-best-practices" rel="noopener noreferrer"&gt;contact center best practices&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import assemblyai as aai

aai.settings.api_key = "&amp;lt;YOUR_API_KEY&amp;gt;"

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    speaker_labels=True,
    speech_understanding={
        "request": {
            "speaker_identification": {
                "speaker_type": "role",
                "speakers": [
                    {"role": "Agent", "name": "Sarah Johnson",
"description": "Customer service representative"},
                    {"role": "Customer"}
                ]
            }
        }
    }
)

transcript = aai.Transcriber().transcribe("your_audio.mp3", config)

for utterance in transcript.utterances:
    print(f"{utterance.speaker}: {utterance.text}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Notice the description field on the Agent speaker. That extra context helps the model distinguish between participants more accurately, especially in ambiguous stretches of audio. You can add any custom properties—company, title, department—that help describe what each speaker typically discusses.&lt;/p&gt;

&lt;p&gt;Common role combinations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;["Agent", "Customer"] — Customer service calls&lt;/li&gt;
&lt;li&gt;["Interviewer", "Interviewee"] — Interview recordings&lt;/li&gt;
&lt;li&gt;["Host", "Guest"] — Podcast or show recordings&lt;/li&gt;
&lt;li&gt;["Support", "Customer"] — Technical support calls&lt;/li&gt;
&lt;li&gt;["AI Assistant", "User"] — AI chatbot interactions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Name-based identification&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When you know exactly who's in the recording—say, from a meeting calendar invite or a CRM record—you can pass their names directly. The model matches names to speakers using conversational cues from the audio itself:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    language_detection=True,
    speaker_labels=True,
    speech_understanding={
        "request": {
            "speaker_identification": {
                "speaker_type": "name",
                "known_values": ["Michel Martin", "Peter DeCarlo"]
            }
        }
    }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For even more accuracy, you can provide structured metadata for each speaker:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;speech_understanding={
    "request": {
        "speaker_identification": {
            "speaker_type": "name",
            "speakers": [
                {
                    "name": "Michel Martin",
                    "description": "Hosts the program and interviews
the guests",
                    "company": "NPR",
                    "title": "Host Morning Edition"
                },
                {
                    "name": "Peter DeCarlo",
                    "description": "Answers questions from the
interview",
                    "company": "Johns Hopkins University",
                    "title": "Professor of Environmental Health
and Engineering"
                }
            ]
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The more context you provide about each speaker, the more accurately the system can match voices to identities—especially in long recordings where speakers may discuss overlapping topics.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Setting speaker count context&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One of the simplest and most effective forms of context is telling the model how many speakers to expect. This constrains the clustering algorithm so it doesn't have to guess, which reduces two common errors: splitting one speaker into multiple labels, or merging two similar-sounding speakers into one.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Exact count (when you're certain)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you know exactly how many speakers are in the recording—say, a 1-on-1 interview or a panel with five confirmed participants—use speakers_expected:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    language_detection=True,
    speaker_labels=True,
    speakers_expected=5,
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Only use this when you're confident about the exact number. If the actual count doesn't match, the model may produce random splits of single-speaker segments or merge multiple speakers into one.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Range (safer for variable scenarios)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When you know the approximate range but not the exact count—a conference call where 2 to 5 people might speak—use speaker_options with min and max values:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    language_detection=True,
    speaker_labels=True,
    speaker_options=aai.SpeakerOptions(
        min_speakers_expected=2,
        max_speakers_expected=5
    ),
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is generally the safer approach. It gives the model flexibility to find the right number within your constraints rather than forcing an exact count. AssemblyAI's documentation recommends setting max_speakers_expected slightly higher than your best estimate (e.g., min_speakers_expected + 2) to allow flexibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A word of caution:&lt;/strong&gt; setting max_speakers_expected too high may reduce accuracy, causing sentences from the same speaker to be split across multiple speaker labels. If you're unsure, it's better to use a reasonable upper bound than an inflated one.&lt;/p&gt;

&lt;p&gt;The default upper limit on speaker count depends on audio duration: no max for 0 to 2 minutes, a maximum of 10 speakers for 2 to 10 minutes, and a maximum of 30 speakers for recordings over 10 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Multichannel: the ultimate context&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When your audio has separate channels per speaker—which is standard in most telephony and contact center setups—you get something better than diarization. You get &lt;em&gt;perfect speaker separation&lt;/em&gt; baked into the recording format itself.&lt;/p&gt;

&lt;p&gt;Most contact center recordings are stereo with the agent on one channel and the customer on the other. Platforms like Genesys, Twilio, Five9, NICE, and Talkdesk typically produce these dual-channel recordings. When you enable &lt;a href="https://www.assemblyai.com/products/speech-to-text" rel="noopener noreferrer"&gt;multichannel transcription&lt;/a&gt;, AssemblyAI transcribes each channel independently:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    multichannel=True,
    speaker_labels=False,  # Channels already separate speakers
    summarization=True,
    sentiment_analysis=True,
)

transcript = aai.Transcriber().transcribe(audio_file, config)

# Channel 1 = Agent, Channel 2 = Customer (typical layout)
for utterance in transcript.utterances:
    role = "Agent" if utterance.channel == "1" else "Customer"
    print(f"{role}: {utterance.text}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The benefits are significant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Perfect speaker separation&lt;/strong&gt; — No diarization errors, no speaker confusion, no overlap issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher transcription accuracy&lt;/strong&gt; — The model processes clean single-speaker audio per channel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No ambiguity&lt;/strong&gt; — Channel assignment is deterministic, not probabilistic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's where it gets interesting: you can also combine multichannel with speaker diarization for cases where individual channels contain multiple speakers. When both are enabled, speakers are labeled with a combined format—"1A" for the first speaker on channel 1, "1B" for the second speaker on channel 1, "2A" for the first speaker on channel 2, and so on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important note:&lt;/strong&gt; when using multichannel with speaker_labels, the speaker_options parameters apply &lt;em&gt;per channel&lt;/em&gt; , not globally. Setting min_speakers_expected: 5 and max_speakers_expected: 7 on a 5-channel file means the model will find 5 to 7 speakers on each channel, resulting in 25 to 35 total speakers. Plan your configuration accordingly.&lt;/p&gt;

&lt;p&gt;Multichannel transcription does increase processing time by approximately 40%, but for applications where speaker attribution accuracy is critical—compliance monitoring, quality assurance, automated coaching—that tradeoff is worth it.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Best practices for accurate diarization&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we dive into use cases, here are the practical tips that make the biggest difference in speaker labeling accuracy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ensure sufficient speech per speaker.&lt;/strong&gt; Each speaker should speak for at least 30 seconds uninterrupted. Short backchannels like "uh huh" and "yeah" don't give the model enough audio to build reliable speaker embeddings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimize cross-talk.&lt;/strong&gt; Overlapping speech between speakers reduces diarization accuracy. When two speakers talk simultaneously, the model assigns the turn to a single speaker.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce background noise.&lt;/strong&gt; Background noise, echoes, and microphone bleed between speakers degrade embedding quality and lead to more frequent misassignments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use speaker_options over speakers_expected when uncertain.&lt;/strong&gt; Only use speakers_expected when you're confident about the exact count. Otherwise, provide a range.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Be aware of speaker similarity.&lt;/strong&gt; If two speakers sound very similar—similar pitch, same accent, comparable speech patterns—the model may have difficulty distinguishing between them. Providing metadata context (names, roles, descriptions) helps resolve these cases.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;streaming diarization&lt;/a&gt;, keep in mind that speaker accuracy improves over time. Early in a session, assignments may be less stable, especially if the first few turns are short. As the session progresses, the model accumulates richer speaker embeddings and assignments become more consistent. For long-form use cases like call center calls, clinical scribe sessions, and meeting transcription, the model settles into accurate, stable labels well before the conversation ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Real-world use cases&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Context-driven speaker labeling isn't an abstract capability. It maps directly to specific industry problems that are hard to solve without it.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Contact centers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Contact centers are the most common use case for context-driven speaker labeling. The typical setup: dual-channel recordings with the agent on one channel and the customer on the other, combined with role-based Speaker Identification.&lt;/p&gt;

&lt;p&gt;With the agent properly identified, you can run sentiment analysis on just the customer's utterances to gauge satisfaction. You can calculate talk-to-listen ratios per agent. You can detect whether the agent followed the required compliance script. You can flag calls where the customer's sentiment trended negative after a specific agent response.&lt;/p&gt;

&lt;p&gt;Companies like Jiminny use speaker-separated transcripts to help sales teams identify winning &lt;a href="https://www.assemblyai.com/blog/conversation-intelligence" rel="noopener noreferrer"&gt;conversation patterns&lt;/a&gt;—which questions agents ask that lead to closed deals, which objections trip them up, and where coaching would have the highest impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Meeting notetakers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Meeting intelligence platforms pull participant names from calendar invites and user profiles, then pass those names into Speaker Identification. The result: meeting transcripts where every statement is attributed to the right person.&lt;/p&gt;

&lt;p&gt;This is what makes features like "search for everything John said about the budget" possible. It's also what enables automated action item extraction that correctly assigns tasks to the person who committed to them, rather than lumping everything under a generic "Speaker C."&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Healthcare&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Clinical documentation requires separating doctor from patient, and getting it right matters for the medical record. Role-based identification with ["Doctor", "Patient"] gives you clean separation. Pair that with &lt;a href="https://www.assemblyai.com/docs/pre-recorded-audio/medical-mode" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt;—which significantly improves accuracy on clinical terminology—and you've got a pipeline that produces accurate, speaker-attributed clinical notes.&lt;/p&gt;

&lt;p&gt;The structural context matters here too. Telehealth calls recorded through platforms with separate audio channels get better results than in-person recordings from a single &lt;a href="https://www.assemblyai.com/blog/ambient-ai-scribe" rel="noopener noreferrer"&gt;ambient microphone&lt;/a&gt; in the exam room.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Podcasts and media&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Podcast producers need to know which speaker said what for searchable transcripts, show notes, and audiogram clips. Host vs. guest identification is straightforward with ["Host", "Guest"] role labeling. For multi-guest episodes, name-based identification with the guest lineup produces transcripts where every quote is correctly attributed.&lt;/p&gt;

&lt;p&gt;This enables content repurposing at scale—pulling the best quotes from a guest, creating topic-specific highlight reels, and generating per-speaker summaries without manual review.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What's next for speaker labeling&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Context-driven speaker labeling has come a long way from the days when diarization systems required you to specify the exact speaker count upfront and returned nothing but anonymous cluster IDs.&lt;/p&gt;

&lt;p&gt;Today, the combination of metadata context (names, roles, descriptions), audio context (conversational cues analyzed by Speaker Identification), and structural context (multichannel separation, speaker count hints) gives you a powerful toolkit for turning raw audio into structured, speaker-attributed transcripts.&lt;/p&gt;

&lt;p&gt;But the trajectory points toward something even more capable. As speaker embedding models improve and voice fingerprinting matures, we're moving toward persistent speaker profiles that work &lt;em&gt;across&lt;/em&gt; recordings. Imagine a system that recognizes a returning customer across multiple support calls without requiring the agent to say their name, or a meeting platform that automatically labels participants because it's heard their voices before.&lt;/p&gt;

&lt;p&gt;AssemblyAI already supports &lt;a href="https://www.assemblyai.com/docs/pre-recorded-audio/guides/titanet-speaker-identification" rel="noopener noreferrer"&gt;cross-file speaker identification&lt;/a&gt; through audio embeddings and vector databases for teams that want to build this today. The foundation is there—what's changing is how seamlessly it all works together.&lt;/p&gt;

&lt;p&gt;For now, the practical takeaway is clear: the more context you give the model, the better your speaker labels get. Whether that's providing a speaker count, passing in names from a calendar invite, using multichannel recordings, or adding role descriptions to your API request—each layer of context compounds into more accurate, more useful transcripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How does context (like names spoken) influence automatic speaker labeling?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Context influences speaker labeling in three ways. Audio context—like when someone says "Hi, I'm Dr. Chen"—gives the model conversational cues to match voices to identities. Metadata context, such as speaker names, roles, and descriptions you provide through the API, constrains the model's predictions. And structural context, including multichannel recordings and speaker count hints, shapes how the diarization pipeline segments and clusters audio. AssemblyAI's &lt;a href="https://www.assemblyai.com/docs/speech-understanding/speaker-identification" rel="noopener noreferrer"&gt;Speaker Identification&lt;/a&gt; feature combines all three to replace generic Speaker A/B labels with real names or roles.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can the system tag speakers with their names automatically if I provide them?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes. AssemblyAI's Speaker Identification lets you pass speaker names directly in your transcription request using speaker_type: "name" with a known_values list or a more detailed speakers array. The model uses conversation content to infer who's speaking and maps voices to the names you provide—no voice enrollment or pre-recorded samples required. You can also provide role labels like "Agent" and "Customer" if you know the structure but not the specific people.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I configure the API to know how many speakers are present in the audio?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You have two options. Use speakers_expected when you're certain about the exact number of speakers—set it to the precise count. Use speaker_options with min_speakers_expected and max_speakers_expected when you know the approximate range. The range-based approach is generally safer because an incorrect exact count can cause the model to split or merge speakers incorrectly. Setting max_speakers_expected too high may also reduce accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How does speaker labeling work in transcripts?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Speaker diarization works by segmenting audio into time-based chunks, extracting voice embeddings from each segment, estimating the number of speakers, and then clustering similar embeddings together. Each cluster gets a consistent label (Speaker A, Speaker B, etc.) applied throughout the transcript. The output is a list of utterances, where each utterance includes the speaker label, the transcribed text, and timestamps. Speaker Identification can then replace those generic labels with actual names or roles.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I separate each speaker in a meeting transcript?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Set speaker_labels=True in your transcription configuration. The model will automatically detect distinct speakers and assign each utterance to the appropriate speaker. For better accuracy, provide a speaker count or range using speakers_expected or speaker_options. To get actual names instead of generic labels, add Speaker Identification with names pulled from your calendar invite or meeting platform. For recordings where each participant is on a separate audio channel, use multichannel=True for perfect speaker separation without diarization.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What's the difference between speaker diarization and Speaker Identification?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Speaker diarization segments audio by voice and assigns generic labels—Speaker A, Speaker B—without knowing who anyone is. It answers "who spoke when" but not "who is each speaker." Speaker Identification goes further by using conversation content and metadata you provide to replace those generic labels with actual names or roles. Diarization is the foundation layer; identification builds on top of it. You need speaker_labels=True enabled before Speaker Identification can work.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>speechrecognition</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to build a voice agent for IT helpdesk and technical support</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 09 Jun 2026 16:47:23 +0000</pubDate>
      <link>https://dev.to/martschweiger/how-to-build-a-voice-agent-for-it-helpdesk-and-technical-support-13p0</link>
      <guid>https://dev.to/martschweiger/how-to-build-a-voice-agent-for-it-helpdesk-and-technical-support-13p0</guid>
      <description>&lt;p&gt;Tier-1 IT support is the most universal use case in the enterprise, and it’s the one people still reach by phone. The reason is almost funny: the moment someone genuinely needs IT, the channels you’d rather they use are often the ones that are broken. You can’t open the self-service portal when you’re locked out of SSO. You can’t chat the helpdesk when the VPN is down and the chat tool is behind it. So the phone rings — “I can’t log in,” “my laptop won’t connect to the network,” “what’s the status on my ticket” — and a human reads from the same runbook for the four-hundredth time this month.&lt;/p&gt;

&lt;p&gt;That repetitive front line is exactly what a voice agent should absorb. This tutorial builds an IT helpdesk &lt;a href="https://www.assemblyai.com/solutions/voice-agents" rel="noopener noreferrer"&gt;voice agent&lt;/a&gt; on the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;AssemblyAI Voice Agent API&lt;/a&gt; that does four things real support lines need: it &lt;strong&gt;routes by issue type&lt;/strong&gt; , &lt;strong&gt;looks up answers in your internal knowledge base&lt;/strong&gt; , &lt;strong&gt;creates and checks tickets&lt;/strong&gt; in your ITSM system, and &lt;strong&gt;escalates to a human&lt;/strong&gt; when an issue needs one. We’ll build it on the managed Voice Agent API first because that’s the fastest path to something running, then show the bring-your-own-key (BYOK) alternative for teams that need a specific LLM or a cloned brand voice. A runnable repository is linked at the end.&lt;/p&gt;

&lt;p&gt;One framing to set up front, because it drives every design decision below: &lt;strong&gt;in IT support, the speech-to-text layer is the part most likely to break the whole interaction.&lt;/strong&gt; Helpdesk speech is dense with alphanumerics — ticket numbers like INC0012345, error codes like 0x80070005, asset tags, employee IDs, VLAN numbers, license keys. A general-purpose transcription model fumbles exactly those strings, and a single wrong character means the agent looks up the wrong ticket or files a useless one. &lt;a href="https://www.assemblyai.com/universal-3-pro-streaming" rel="noopener noreferrer"&gt;Universal-3 Pro Streaming&lt;/a&gt; is tuned for this: 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. That’s not a vanity metric for a helpdesk — it’s the difference between a contained call and an escalated one.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What the agent needs to do&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Strip an IT helpdesk call down and there are four jobs. Map each to a concrete capability:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Job&lt;/th&gt;
&lt;th&gt;How the agent does it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Route&lt;/strong&gt; the caller by issue type&lt;/td&gt;
&lt;td&gt;The LLM classifies the request from the system prompt's routing rules, then picks the right tool or escalation queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Answer&lt;/strong&gt; known questions&lt;/td&gt;
&lt;td&gt;A search_knowledge_base tool returns grounded snippets from your runbooks and docs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Act&lt;/strong&gt; in the ITSM system&lt;/td&gt;
&lt;td&gt;create_ticket and check_ticket_status tools call ServiceNow, Jira Service Management, or Zendesk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Escalate&lt;/strong&gt; when needed&lt;/td&gt;
&lt;td&gt;A terminal escalate_to_human tool hands off to the right queue with a written summary&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt; gives you the conversation loop — speech-to-text, the LLM, text-to-speech, turn detection, and tool calling — over a single WebSocket at a flat $4.50/hour. Your job is to define the four tools and write a system prompt that knows when to use them. Everything else (transcription accuracy, interruption handling, generating speech) is handled inside that one connection.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Caller --PSTN--&amp;gt; Twilio number
                      |
                      |  &amp;lt;Connect&amp;gt;&amp;lt;Stream&amp;gt; (bidirectional, audio/pcmu)
                      v
              bridge_server.py  --ws--&amp;gt;  Voice Agent API
                      |                  STT (Universal-3 Pro) + LLM + TTS
                      |                  + turn detection + tool calling
                      |                          |
                      |        tool.call --------|
                      |   +--------------+-------+-----------+-----------+
                      |   v              v               v               v
                      | search_kb    create_ticket   check_status   escalate_to_human
                      | (interactive) (interactive)  (interactive)   (hold -&amp;gt; transfer)
                      |   |               |               |               |
                      |   +---- tool.result (JSON) -------+          terminal: no result,
                      |         returned on reply.done                hand call to a queue
                      v
                Your KB + ITSM (ServiceNow / Jira SM / Zendesk)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The bridge server is thin: it shuttles audio between Twilio and the Voice Agent API, and when the agent fires a tool call, it runs the work and returns the result. The three non-terminal tools (search_knowledge_base, create_ticket, check_ticket_status) follow a request/response round-trip — the agent asks, you answer, the conversation continues. The fourth, escalate_to_human, is terminal: it ends the agent’s session and hands the call to a person.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Before you start&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You’ll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;AssemblyAI account&lt;/a&gt; with Voice Agent API access&lt;/li&gt;
&lt;li&gt;A Twilio account with a voice-capable phone number&lt;/li&gt;
&lt;li&gt;Read/write API access to your ITSM (ServiceNow, Jira Service Management, or Zendesk) and a searchable knowledge base&lt;/li&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install the dependencies:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install fastapi uvicorn "websockets&amp;gt;=14" python-dotenv twilio httpx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Define the helpdesk tools&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Tools are JSON Schema function definitions you pass in the session config. The Voice Agent API emits a tool.call event when the LLM decides to use one; you run the work and send back a tool.result. Good tool descriptions are the highest-leverage thing you’ll write — the description is &lt;em&gt;when to call this, and when not to&lt;/em&gt; , in plain language, because that’s the only instruction the model gets at call time.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tools.py

SEARCH_KB = {
    "type": "function",
    "name": "search_knowledge_base",
    "description": (
        "Search the internal IT knowledge base for how-to steps
and known fixes. "
        "Call this for any 'how do I' or 'why is X happening' question —
VPN setup, "
        "password policy, software install steps, printer config, known
outages. "
        "Do NOT call this to look up a specific ticket; use 
check_ticket_status for that."
    ),
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "The user's question, rephrased as a
search query.",
            },
            "category": {
                "type": "string",
                "enum": ["access", "network", "hardware", "software", 
"security"],
            },
        },
        "required": ["query"],
    },
    "execution_mode": "interactive",
    "timeout_seconds": 15,
}

CREATE_TICKET = {
    "type": "function",
    "name": "create_ticket",
    "description": (
        "Open a new support ticket. Call this only after you have a
clear, "
        "one-line problem description and the caller's employee ID. Read
the "
        "returned ticket number back to the caller exactly as digits."
    ),
    "parameters": {
        "type": "object",
        "properties": {
            "employee_id": {"type": "string"},
            "category": {
                "type": "string",
                "enum": ["access", "network", "hardware", "software", 
"security"],
            },
            "priority": {"type": "string", "enum": ["low", "normal", 
"high", "urgent"]},
            "summary": {"type": "string", "description": "One-line 
problem statement."},
        },
        "required": ["employee_id", "category", "summary"],
    },
    "execution_mode": "interactive",
    "timeout_seconds": 15,
}
CHECK_STATUS = {
    "type": "function",
    "name": "check_ticket_status",
    "description": (
        "Look up the status of an existing ticket by its number "
        "(for example INC0012345). Use this whenever the caller asks "
        "'what's the status of my ticket' or gives you a ticket number."
    ),
    "parameters": {
        "type": "object",
        "properties": {"ticket_id": {"type": "string"}},
        "required": ["ticket_id"],
    },
    "execution_mode": "interactive",
    "timeout_seconds": 15,
}

ESCALATE = {
    "type": "function",
    "name": "escalate_to_human",
    "description": (
        "Transfer the caller to a human technician. Call this when the 
caller "
        "asks for a person, when the issue is a security incident, when 
a system "
        "is down for many users, or when you cannot resolve the issue 
after two "
        "attempts. Always write a one-line summary first."
    ),
    "parameters": {
        "type": "object",
 "properties": {
            "queue": {
                "type": "string",
                "enum": ["service_desk", "identity_access", "network",
"security", "on_call"],
                "description": "Which team should take the call.",
            },
            "summary": {
                "type": "string",
                "description": (
                    "One sentence a technician can read in two seconds. 
Example: "
                    "'Dana Okafor (emp 4471) locked out of SSO after a 
password "
                    "reset, MFA not arriving, needs urgent access.'"
                ),
            },
        },
        "required": ["queue", "summary"],
    },
    "execution_mode": "hold",
}

HELPDESK_TOOLS = [SEARCH_KB, CREATE_TICKET, CHECK_STATUS, ESCALATE]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Two design points worth calling out. First, the &lt;strong&gt;category&lt;/strong&gt; &lt;strong&gt;enum is your routing key&lt;/strong&gt; — it appears in the KB search, the ticket, and the escalation queue, so the same classification the LLM makes for one tool carries through the whole call. Second, &lt;strong&gt;escalate_to_human&lt;/strong&gt; &lt;strong&gt;uses execution_mode: "hold"&lt;/strong&gt; while the other three use "interactive". Interactive tools are expected to resolve in a few seconds and the agent waits silently. A transfer is different — it takes longer and the agent should keep the line warm (“Let me get a technician for you — one moment”), which is what hold mode is for. The &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling" rel="noopener noreferrer"&gt;tool-calling docs&lt;/a&gt; cover the execution modes in full.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: The tool execution loop&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is the piece the conversational tutorials gloss over and the piece you have to get exactly right. The contract is specific: the agent sends a tool.call, and you return a tool.result keyed by the same call_id, with the result encoded as a &lt;strong&gt;JSON string&lt;/strong&gt; (not a raw object). The timing matters too — you return results when the agent is idle, not the instant the call arrives, so you never interrupt audio that’s still playing.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tool_loop.py
import json

async def run_tool(call: dict) -&amp;gt; dict:
    """Dispatch one tool.call to its handler. Returns a plain dict."""
    name = call["name"]
    args = json.loads(call.get("arguments") or "{}")

    if name == "search_knowledge_base":
        return await kb_search(args["query"], args.get("category"))
    if name == "create_ticket":
        return await itsm_create_ticket(args)
    if name == "check_ticket_status":
        return await itsm_get_ticket(args["ticket_id"])
    return {"error": f"unknown tool {name}"}

async def flush_pending(va_ws, pending, last_type):
    """Return results only at a reply.done boundary, so we never
    cut off audio the agent is still speaking."""
    if last_type != "reply.done" or not pending:
        return
    for call in pending:
        result = await run_tool(call)
        await va_ws.send(json.dumps({
            "type": "tool.result",
            "call_id": call["call_id"],
            "result": json.dumps(result),   # result must be a JSON STRING
        }))
    pending.clear()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The flow is: stash each tool.call in a pending list, and on the next reply.done, run the work and send every result. If the caller barges in (input.speech.started) before you’ve answered, drop the pending calls — the caller has moved on, and answering a stale question is worse than not answering it.&lt;/p&gt;

&lt;p&gt;escalate_to_human is the exception to all of this. &lt;strong&gt;You never send a tool.result for it.&lt;/strong&gt; Transferring the call ends the , which tears down the Voice Agent session — there’s no live session left to receive a result. It’s a &lt;em&gt;terminal&lt;/em&gt; tool. The deep mechanics of doing the transfer warmly (keeping a live transcript running for the technician across a conference bridge) are their own subject; the short version is that you detect the call, package the summary, and hand off at the telephony layer. The pattern below does the hand-off; the &lt;a href="https://www.assemblyai.com/solutions/voice-agents" rel="noopener noreferrer"&gt;voice agents solution page&lt;/a&gt; covers the warm-handoff variant.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Bridge Twilio to the Voice Agent API&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now wire it together. Twilio streams G.711 μ-law at 8 kHz, which the Voice Agent API accepts natively when you set the encoding to audio/pcmu — no transcoding, lowest latency. A few endpoint-specific details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The endpoint is wss://agents.assemblyai.com/v1/ws.&lt;/li&gt;
&lt;li&gt;Auth is Authorization: Bearer YOUR_KEY — note the &lt;strong&gt;Bearer&lt;/strong&gt; prefix, which is specific to the Voice Agent API.&lt;/li&gt;
&lt;li&gt;The first message is a session.update event with everything nested under a session object. There is no separate session.start.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Wait for session.ready before sending any input.audio frames. Voice, greeting, and output format are fixed once the session is ready, so set them in this first message — see the &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/session-configuration" rel="noopener noreferrer"&gt;session configuration docs&lt;/a&gt;.&lt;/p&gt;
&lt;h1&gt;
  
  
  bridge_server.py
&lt;/h1&gt;

&lt;p&gt;import asyncio, json, os&lt;br&gt;
import websockets&lt;br&gt;
from fastapi import FastAPI, Request, WebSocket&lt;br&gt;
from fastapi.responses import Response&lt;/p&gt;

&lt;p&gt;from prompts import SYSTEM_PROMPT, GREETING&lt;br&gt;
from tools import HELPDESK_TOOLS&lt;br&gt;
from tool_loop import flush_pending, run_tool&lt;br&gt;
from transfer import start_transfer&lt;/p&gt;

&lt;p&gt;VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws"&lt;br&gt;
ASSEMBLYAI_KEY = os.environ["ASSEMBLYAI_API_KEY"]&lt;/p&gt;

&lt;p&gt;app = FastAPI()&lt;/p&gt;

&lt;p&gt;@app.post("/twilio/voice")&lt;br&gt;
async def twilio_voice(request: Request):&lt;br&gt;
    host = request.url.hostname&lt;br&gt;
    twiml = f"""&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;&lt;br&gt;
&lt;br&gt;
  &lt;br&gt;
    &lt;br&gt;
  &lt;br&gt;
"""&lt;br&gt;
    return Response(content=twiml, media_type="application/xml")&lt;/p&gt;

&lt;p&gt;@app.websocket("/media-stream")&lt;br&gt;
async def media_stream(twilio_ws: WebSocket):&lt;br&gt;
    await twilio_ws.accept()&lt;br&gt;
    stream_sid = {"value": None}&lt;br&gt;
    call_sid = {"value": None}&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;session_config = {
    "type": "session.update",
    "session": {
        "system_prompt": SYSTEM_PROMPT,
        "greeting": GREETING,
        "tools": HELPDESK_TOOLS,
        "input": {
            "format": {"encoding": "audio/pcmu"},
            # Boost the alphanumeric jargon a helpdesk hears
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;constantly.&lt;br&gt;
                "keyterms": ["Okta", "VPN", "MFA", "VLAN", "Kerberos",&lt;br&gt;
                             "SSO", "ServiceNow", "Active Directory"],&lt;br&gt;
                "turn_detection": {&lt;br&gt;
                    "vad_threshold": 0.5,&lt;br&gt;
                    "min_silence": 800,&lt;br&gt;
                    "max_silence": 2500,&lt;br&gt;
                    "interrupt_response": True,&lt;br&gt;
                },&lt;br&gt;
            },&lt;br&gt;
            "output": {"voice": "ivy", "format": {"encoding":&lt;br&gt;
"audio/pcmu"}},&lt;br&gt;
        },&lt;br&gt;
    }&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async with websockets.connect(
    VOICE_AGENT_WS,
    additional_headers={"Authorization": f"Bearer 
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;{ASSEMBLYAI_KEY}"},&lt;br&gt;
    ) as va_ws:&lt;br&gt;
        await va_ws.send(json.dumps(session_config))&lt;br&gt;
        ready = asyncio.Event()&lt;br&gt;
        state = {"last_type": None, "pending": [], "transferring": &lt;br&gt;
False}&lt;/p&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    async def twilio_to_va():
        async for raw in twilio_ws.iter_text():
            event = json.loads(raw)
            kind = event.get("event")
            if kind == "start":
                stream_sid["value"] = event["start"]["streamSid"]
                call_sid["value"] = event["start"]["callSid"]
            elif kind == "media" and ready.is_set():
                await va_ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": event["media"]["payload"],
                }))
            elif kind == "stop":
                return

    async def va_to_twilio():
        async for raw in va_ws:
            event = json.loads(raw)
            t = event.get("type")
            state["last_type"] = t

            if t == "session.ready":
                ready.set()
            elif t == "reply.audio" and stream_sid["value"]:
                await twilio_ws.send_text(json.dumps({
                    "event": "media",
                    "streamSid": stream_sid["value"],
                    "media": {"payload": event["data"]},
                }))
            elif t == "input.speech.started":
                state["pending"].clear()   # caller barged in; drop
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;stale calls&lt;br&gt;
                elif t == "tool.call":&lt;br&gt;
                    if event["name"] == "escalate_to_human" and not &lt;br&gt;
state["transferring"]:&lt;br&gt;
                        state["transferring"] = True&lt;br&gt;
                        asyncio.create_task(&lt;br&gt;
                            start_transfer(call_sid["value"], &lt;br&gt;
event.get("arguments", {}))&lt;br&gt;
                        )&lt;br&gt;
                    else:&lt;br&gt;
                        state["pending"].append(event)&lt;br&gt;
                elif t == "reply.done":&lt;br&gt;
                    await flush_pending(va_ws, state["pending"], &lt;br&gt;
state["last_type"])&lt;br&gt;
        await asyncio.gather(twilio_to_va(), va_to_twilio())&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s the whole bridge. Notice the asymmetry in how tools are handled in the tool.call branch: escalate_to_human fires off a telephony transfer and is never added to pending (it gets no result), while the three data tools are stashed and answered on the next reply.done. The keyterms list is doing quiet but important work — it nudges the recognizer toward the product names a helpdesk hears all day, so “Okta” doesn’t come back as “octa” and “VLAN” doesn’t become “v-land.”&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: The system prompt does the routing&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;There’s no separate “router” component. Routing &lt;em&gt;is&lt;/em&gt; the system prompt plus the tool descriptions — the LLM reads the caller’s problem, classifies it, and either resolves it with a tool or escalates to the right queue. Write the prompt as an operations runbook, not a personality.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# prompts.py
GREETING = "IT support, this line is recorded for quality. What can I help you with?"

SYSTEM_PROMPT = """You are the IT helpdesk voice agent for Northwind Corp.
You are calm, fast, and precise. One or two sentences per turn. Never lecture.

WHAT YOU HANDLE DIRECTLY (use search_knowledge_base, then guide the caller):
- Access: password policy, SSO setup, MFA enrollment, account unlock STEPS
- Network: VPN setup and troubleshooting, Wi-Fi, known outages
- Hardware: printer setup, monitor/dock issues, loaner requests
- Software: install/license steps for approved apps

TICKETS:
- Open a ticket with create_ticket when the issue needs follow-up or you cannot
  resolve it live. Always get the employee ID first. Read the returned ticket
  number back digit by digit and confirm it.
- For "what's the status of my ticket," use check_ticket_status with the number.

ROUTING / ESCALATION (call escalate_to_human with the right queue):
- "identity_access": account lockouts you cannot clear, suspicious login activity
- "network": a system or site down for MANY users
- "security": anything that sounds like a security incident, phishing, or breach
- "on_call": production outage outside business hours
- "service_desk": caller asks for a person, or two resolution attempts failed

HARD RULES:
- NEVER ask for or accept a password, PIN, or MFA code over the phone. To reset
 credentials, trigger the self-service reset (it sends a secure link) or escalate
  to identity_access for verified identity checks.
- NEVER invent a ticket number, an error-code meaning, or a fix. If the knowledge
  base doesn't have it, say so and open a ticket or escalate.
- Confirm ticket numbers and error codes by reading them back before acting.
- For a security incident, escalate immediately — do not troubleshoot.
"""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Three things in that prompt earn their place. The &lt;strong&gt;routing block maps issue types to escalation queues&lt;/strong&gt; , so “the whole sales floor lost network ten minutes ago” goes to network while “I think I clicked a phishing link” goes straight to security with no troubleshooting. The &lt;strong&gt;credential rule is a security boundary, not a nicety&lt;/strong&gt; — a voice agent should never collect passwords or MFA codes; it triggers a self-service reset link or routes to a verified human check. And the &lt;strong&gt;anti-fabrication rule&lt;/strong&gt; (“never invent a ticket number, an error-code meaning, or a fix”) is what keeps the agent trustworthy: it answers from the knowledge base or it says it doesn’t know and opens a ticket. An IT agent that confidently makes up a fix is worse than no agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Ground every answer in the knowledge base&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The search_knowledge_base handler is where “looks up answers in your docs” becomes real. The pattern that matters: &lt;strong&gt;return source snippets, and instruct the model to answer only from them.&lt;/strong&gt; This is retrieval-augmented generation applied to a phone call — the LLM’s job is to read your runbook back conversationally, not to recall IT trivia from its training data.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# integrations.py
import httpx, os

KB_URL = os.environ["KB_SEARCH_URL"]
ITSM_URL = os.environ["ITSM_API_URL"]
ITSM_TOKEN = os.environ["ITSM_TOKEN"]

async def kb_search(query: str, category: str | None = None) -&amp;gt; dict:
    async with httpx.AsyncClient(timeout=10) as http:
        resp = await http.get(KB_URL, params={"q": query, "category":
category, "top": 3})
        resp.raise_for_status()
        hits = resp.json().get("results", [])
    # Return only grounded snippets. The agent is instructed to answer
from these.
    return {
        "found": bool(hits),
        "snippets": [{"title": h["title"], "text": h["summary"], "url":
h["url"]}
                     for h in hits],
    }

async def itsm_create_ticket(args: dict) -&amp;gt; dict:
    async with httpx.AsyncClient(timeout=10) as http:
        resp = await http.post(
            f"{ITSM_URL}/tickets",
            headers={"Authorization": f"Bearer {ITSM_TOKEN}"},
            json={
                "requester_id": args["employee_id"],
                "category": args["category"],
                "priority": args.get("priority", "normal"),
                "short_description": args["summary"],
                "channel": "voice_agent",
            },
        )
        resp.raise_for_status()
        ticket = resp.json()
    return {"ticket_id": ticket["number"], "state": ticket["state"]}

async def itsm_get_ticket(ticket_id: str) -&amp;gt; dict:
    async with httpx.AsyncClient(timeout=10) as http:
        resp = await http.get(
            f"{ITSM_URL}/tickets/{ticket_id}",
            headers={"Authorization": f"Bearer {ITSM_TOKEN}"},
        )
        if resp.status_code == 404:
            return {"found": False}
        resp.raise_for_status()
        t = resp.json()
    return {"found": True, "ticket_id": t["number"], "state":
t["state"],
            "assigned_to": t.get("assigned_to"), "updated": 
t.get("updated_at")}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The create_ticket handler returns the system-generated ticket number, and the prompt tells the agent to read it back digit by digit and confirm. That read-back loop is where Universal-3 Pro Streaming’s alphanumeric accuracy pays off twice: once when the agent hears the caller’s existing ticket number correctly, and again when the caller confirms the new one. Tag tickets with "channel": "voice_agent" so you can measure containment later — you’ll want to know how many of these the agent closed without a human.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;When to go BYOK instead&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The managed Voice Agent API is the fastest path: one connection, a flat rate, and STT, LLM, and TTS handled for you. But some teams need to choose their own components — a fine-tuned LLM that knows their internal systems, a cloned brand voice for TTS, or an existing orchestration stack they’ve already standardized on. That’s the bring-your-own-key path, and it’s where the competitive noise around “BYOK voice stacks” lives.&lt;/p&gt;

&lt;p&gt;Here’s the honest framing. A composed voice stack is a chain — STT, then LLM, then TTS — and &lt;strong&gt;its accuracy ceiling is set by the first link.&lt;/strong&gt; If transcription mishears INC0012345 as INC0012845, no downstream LLM or voice can recover; the agent confidently looks up the wrong ticket. So in a BYOK stack the speech-to-text layer isn’t a commodity, it’s the foundation. Universal-3 Pro Streaming (u3-rt-pro) is built for exactly the alphanumeric-dense speech IT support generates, at $0.45/hour, and it drops into any orchestrator.&lt;/p&gt;

&lt;p&gt;Tools like LiveKit and Pipecat are orchestration frameworks — they manage the media transport and the STT→LLM→TTS loop. They’re integration partners, not alternatives: you point them at Universal-3 Pro Streaming for the transcription leg. The connection is the standalone Universal Streaming API, and the auth differs from the Voice Agent API in one easy-to-miss way:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# byok_stt.py — the transcription leg of a BYOK stack
import asyncio, json
import websockets
from urllib.parse import urlencode

STREAMING_WS = "wss://streaming.assemblyai.com/v3/ws"

async def transcribe_leg(pcm16_audio_source, on_final_transcript):
    params = urlencode({
        "speech_model": "u3-rt-pro",   # Universal-3 Pro Streaming; no
default — required
        "sample_rate": 8000,           # match the telephony source;
don't upsample
        "format_turns": "true",
    })
    async with websockets.connect(
        f"{STREAMING_WS}?{params}",
        additional_headers={"Authorization": ASSEMBLYAI_KEY},  # RAW 
key, no "Bearer"
    ) as aai_ws:

        async def send_audio():
            async for pcm16_chunk in pcm16_audio_source:
                await aai_ws.send(pcm16_chunk)

        async def read_turns():
            async for raw in aai_ws:
                msg = json.loads(raw)
                if msg.get("type") == "Turn" and msg.get("end_of_turn"):
                    # Hand the final transcript to YOUR LLM, then YOUR 
TTS.
                    await on_final_transcript(msg["transcript"])
        await asyncio.gather(send_audio(), read_turns())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Note the auth difference, because it’s the most common reason a copy-pasted snippet returns a 401: the standalone Streaming API takes the &lt;strong&gt;raw API key&lt;/strong&gt; in the Authorization header, while the Voice Agent API takes Bearer YOUR_KEY. From there, on_final_transcript is your handoff point — send the text to your own LLM, or to the OpenAI-compatible &lt;a href="https://www.assemblyai.com/docs/llm-gateway/overview" rel="noopener noreferrer"&gt;LLM Gateway&lt;/a&gt; if you want a single endpoint fronting 25+ models, then to your TTS of choice. (If you’re in Python, pip install assemblyai wraps this WebSocket in a StreamingClient with the same parameters.)&lt;/p&gt;

&lt;p&gt;The decision in one line: &lt;strong&gt;use the managed Voice Agent API to ship fast on a flat rate; go BYOK with Universal-3 Pro Streaming when you need a specific LLM or voice, or you already run LiveKit or Pipecat.&lt;/strong&gt; Either way the transcription is the same accuracy, which is the part that decides whether the agent gets the ticket number right.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Measuring whether it’s working&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Four numbers tell you if the helpdesk agent is earning its keep:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Containment rate:&lt;/strong&gt; the share of calls the agent resolved without escalating. Tag voice-agent tickets (we set channel: "voice_agent" above) and compare closed-without-transfer against total. This is the headline ROI number.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Escalation precision:&lt;/strong&gt; of the calls the agent escalated, how many genuinely needed a human. Too many means your routing rules are too eager; too few (callers asking twice for a person) means they’re too conservative.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alphanumeric accuracy:&lt;/strong&gt; sample calls with ticket numbers and error codes, and check how often the agent captured them correctly. This is the metric most directly tied to your STT model — it’s where Universal-3 Pro Streaming should show its margin.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First-contact resolution:&lt;/strong&gt; of contained calls, how many didn’t generate a repeat call within 48 hours. A high containment rate with low first-contact resolution means the agent is closing calls it didn’t actually fix.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The complete repository&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Fork the runnable repo at &lt;a href="https://github.com/kelsey-aai/voice-agent-it-helpdesk" rel="noopener noreferrer"&gt;github.com/kelsey-aai/voice-agent-it-helpdesk&lt;/a&gt;. It includes the Twilio bridge, the four-tool definitions, the execution loop, mock KB and ITSM integrations you can swap for ServiceNow / Jira Service Management / Zendesk, the routing system prompt, and the BYOK transcription leg. Around 500 lines of Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Build your IT helpdesk agent&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The fastest way to feel the difference accuracy makes on helpdesk speech is to call your own number and read it a ticket number. &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Create an AssemblyAI account&lt;/a&gt; to get an API key, then start from the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API docs&lt;/a&gt; — the four tools above are the whole agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I build a voice agent for IT helpdesk and technical support?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Build it on the AssemblyAI Voice Agent API, a single WebSocket that handles speech-to-text, the LLM, text-to-speech, turn detection, and tool calling. Define four function tools — search_knowledge_base for grounded how-to answers, create_ticket and check_ticket_status for your ITSM system, and escalate_to_human for transfers — and write a system prompt that classifies each call by issue type and picks the right tool or escalation queue. Bridge it to telephony with Twilio using audio/pcmu encoding, and run Universal-3 Pro Streaming so the agent transcribes ticket numbers, error codes, and asset tags accurately. The managed Voice Agent API runs at a flat $4.50/hour; a BYOK alternative uses Universal-3 Pro Streaming as the transcription layer under your own LLM and TTS.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can a voice agent create and look up tickets in ServiceNow, Jira, or Zendesk?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes. Define create_ticket and check_ticket_status as function tools in the Voice Agent API session config. When the agent calls one, your bridge receives a tool.call event, makes the corresponding REST request to your ITSM platform (ServiceNow, Jira Service Management, or Zendesk), and returns a tool.result keyed by the same call_id with the result encoded as a JSON string. The agent then reads the ticket number back to the caller. Tag tickets with a voice_agent channel so you can measure containment afterward. The ITSM platform is yours to choose — the Voice Agent API doesn’t integrate with a specific one, it just calls whatever tool you define.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How does the voice agent route calls by issue type?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Routing happens in the system prompt and the tool descriptions, not in a separate component. The LLM reads the caller’s problem, classifies it into a category (access, network, hardware, software, security), and either resolves it with a knowledge-base lookup and a ticket, or escalates to the matching human queue. You encode the routing rules as plain instructions — for example, “a system down for many users goes to the network queue; anything that sounds like a security incident goes to security immediately with no troubleshooting.” Because the same category drives the KB search, the ticket, and the escalation queue, one classification carries through the whole call.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I stop the voice agent from making up answers about IT issues?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Two mechanisms. First, ground every answer: the search_knowledge_base tool returns source snippets from your runbooks, and the system prompt instructs the agent to answer only from those snippets and to say it doesn’t know (then open a ticket or escalate) when the knowledge base has nothing. This is retrieval-augmented generation on a phone call. Second, add explicit anti-fabrication rules to the prompt — “never invent a ticket number, an error-code meaning, or a fix” — and require the agent to read ticket numbers and error codes back to the caller for confirmation before acting. &lt;/p&gt;

&lt;p&gt;Accurate transcription matters here too: if the agent mishears the input, even a grounded answer is about the wrong thing.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Should I use the managed Voice Agent API or a BYOK stack for IT support?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use the managed Voice Agent API when you want to ship fast on a single connection and a flat $4.50/hour rate, with speech-to-text, LLM, and text-to-speech handled for you. Choose a bring-your-own-key (BYOK) stack when you need a specific fine-tuned LLM, a cloned brand voice, or you already run an orchestration framework like LiveKit or Pipecat. In the BYOK case, use Universal-3 Pro Streaming (u3-rt-pro, $0.45/hour) as the transcription layer — it’s the foundation of the stack, because if speech-to-text mishears a ticket number, no downstream LLM or voice can recover. LiveKit and Pipecat are orchestration partners, not alternatives; you point them at Universal-3 Pro Streaming for the speech-to-text leg.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why does speech-to-text accuracy matter so much for IT support voice agents?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Because IT support speech is unusually dense with alphanumerics — ticket numbers like INC0012345, error codes like 0x80070005, asset tags, employee IDs, VLAN numbers, and license keys. A single wrong character means the agent looks up the wrong ticket, files a useless one, or troubleshoots the wrong error. General-purpose transcription models tend to fumble exactly these strings. Universal-3 Pro Streaming is tuned for them, with 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation, which on a helpdesk is the difference between a call the agent contains and one it has to escalate. Adding domain key terms (product names like Okta, VLAN, Kerberos) to the session config sharpens recognition further.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>itops</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Transcription accuracy vs. transcription quality: why the gap matters</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 02 Jun 2026 17:15:59 +0000</pubDate>
      <link>https://dev.to/martschweiger/transcription-accuracy-vs-transcription-quality-why-the-gap-matters-5eok</link>
      <guid>https://dev.to/martschweiger/transcription-accuracy-vs-transcription-quality-why-the-gap-matters-5eok</guid>
      <description>&lt;p&gt;Your speech-to-text model has a great word error rate. Your benchmarks look solid. So why are users still complaining that the transcription "feels wrong"?&lt;/p&gt;

&lt;p&gt;Because WER doesn't measure what customers care about.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The gap between accuracy and perception&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/blog/word-error-rate" rel="noopener noreferrer"&gt;Word error rate&lt;/a&gt; is the industry standard for measuring &lt;a href="https://www.assemblyai.com/products/speech-to-text" rel="noopener noreferrer"&gt;speech-to-text&lt;/a&gt; accuracy, and for good reason—it's quantifiable, comparable, and well-understood. But here's the problem: WER measures whether the right words appear in the transcript. It says nothing about whether the transcript &lt;em&gt;looks&lt;/em&gt; right to the person reading it.&lt;/p&gt;

&lt;p&gt;Think about it. A transcript can have a near-perfect WER and still feel broken if speakers are mislabeled, if stray audio tags clutter the output, or if &lt;a href="https://www.assemblyai.com/blog/boosting-transcript-readability-with-automatic-punctuation-and-casing-and-itn" rel="noopener noreferrer"&gt;punctuation&lt;/a&gt; is inconsistent. Conversely, a transcript with a slightly higher WER but clean formatting, accurate speaker labels, and natural paragraph breaks will feel more reliable to users.&lt;/p&gt;

&lt;p&gt;This is the perceived quality gap—and it rarely shows up in published benchmarks.&lt;/p&gt;

&lt;p&gt;The data backs this up. According to &lt;a href="https://www.assemblyai.com/voice-agent-report" rel="noopener noreferrer"&gt;AssemblyAI's Voice Agent Report&lt;/a&gt;, 55% of end users cite "having to repeat themselves" as their top frustration with voice agents, and 45% cite "frequently misheard words"—even though 82.5% of builders feel confident in their ability to build. On the builder side, 52.5% name transcription accuracy as their single biggest challenge. The gap between builder confidence and user frustration is the perceived quality gap made measurable: teams think they've solved accuracy, but users disagree.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Audio tags: when accuracy backfires&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's a concrete example of how doing the "right thing" can make transcription quality worse.&lt;/p&gt;

&lt;p&gt;Many speech-to-text systems insert audio event tags into transcripts—things like [MUSIC], [NOISE], or [LAUGHTER]. From a technical standpoint, this is accurate. The model detected non-speech audio and labeled it. WER doesn't penalize you for it. If anything, it's a feature.&lt;/p&gt;

&lt;p&gt;But when we looked at how users responded to transcripts with audio tags, the picture was different. Users reported that tagged transcripts felt &lt;em&gt;less&lt;/em&gt; accurate than untagged ones—even when the underlying words were identical. An [MUSIC] tag dropped into the middle of a meeting transcript made people doubt everything around it. "If the system is picking up background noise, how do I know the words are right?"&lt;/p&gt;

&lt;p&gt;It's not a rational response, but it's a real one—and real user perception drives product decisions like NPS scores and renewal rates.&lt;/p&gt;

&lt;p&gt;So we removed audio tags from transcripts by default. The WER didn't change. The perceived quality went up.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Speaker diarization: the trust multiplier&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Speaker mislabeling is a far more damaging perception problem than audio tags.&lt;/p&gt;

&lt;p&gt;Consider a contact center running thousands of calls through a speech-to-text pipeline every day. Each call gets transcribed with &lt;a href="https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work" rel="noopener noreferrer"&gt;speaker diarization&lt;/a&gt;—Speaker 1 is the agent, Speaker 2 is the customer. Downstream systems use those labels to analyze &lt;a href="https://www.assemblyai.com/solutions/conversation-intelligence" rel="noopener noreferrer"&gt;agent performance&lt;/a&gt;, flag compliance issues, and generate &lt;a href="https://www.assemblyai.com/blog/conversation-intelligence-in-contact-centers" rel="noopener noreferrer"&gt;call summaries&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now imagine Speaker 1 and Speaker 2 get swapped on 3% of calls.&lt;/p&gt;

&lt;p&gt;From a WER perspective, nothing went wrong. Every word is correct. But from the customer's perspective, their analytics are corrupted. Agent performance scores are unreliable. Compliance flags fire on the wrong speaker. The entire pipeline's credibility is undermined by a problem that WER can't even see.&lt;/p&gt;

&lt;p&gt;We've worked with enterprise customers pushing &lt;a href="https://www.assemblyai.com/blog/multichannel-speaker-diarization" rel="noopener noreferrer"&gt;multichannel speaker diarization&lt;/a&gt; to its limits in production—hundreds of concurrent sessions, variable audio quality, speakers talking over each other. At that scale, diarization accuracy isn't a nice-to-have. It's a trust requirement. One mislabeled speaker in a compliance-critical transcript doesn't just create an error. It destroys confidence in the system.&lt;/p&gt;

&lt;p&gt;This is why speech-to-text accuracy can't be reduced to a single number. The accuracy that matters is contextual—right words and right speaker, delivered in a structure users can trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The hard problem of streaming corrections&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;Streaming speech-to-text&lt;/a&gt; introduces a unique challenge for perceived quality: the output is live.&lt;/p&gt;

&lt;p&gt;When you're transcribing pre-recorded audio, you have the luxury of processing the entire file before returning results. You can recluster speakers at the end, clean up edge cases, and deliver a polished final transcript. With streaming, you're committing to output in real time—sub-300ms latency for &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro Streaming&lt;/a&gt;—which means you sometimes have to make decisions with incomplete information.&lt;/p&gt;

&lt;p&gt;Speaker assignment is a perfect example. Early in a conversation, the model hasn't heard enough audio to confidently distinguish speakers. It makes its best guess and moves on. Later, with more context, it might realize that the initial speaker assignments need correction.&lt;/p&gt;

&lt;p&gt;The brute-force solution is end-of-stream reclustering: wait until the conversation ends, then reprocess all speaker labels with full context. That works for some use cases. But for applications where transcripts are consumed in real time— &lt;a href="https://www.assemblyai.com/blog/what-is-real-time-agent-assist" rel="noopener noreferrer"&gt;live agent assist&lt;/a&gt;, real-time coaching, &lt;a href="https://www.assemblyai.com/blog/call-center-analytics" rel="noopener noreferrer"&gt;compliance monitoring&lt;/a&gt;—waiting until the end isn't an option. Users have already seen the initial labels. A late correction feels like an error, even when it's an improvement.&lt;/p&gt;

&lt;p&gt;So we're developing a different approach: speaker revision messages that arrive shortly after the initial output—a delayed correction that updates speaker labels while the conversation is still active, rather than waiting for the end. Recent &lt;a href="https://www.assemblyai.com/blog/streaming-speaker-diarization" rel="noopener noreferrer"&gt;streaming diarization&lt;/a&gt; improvements have already delivered measurable gains: a 56% reduction in phantom speaker detections, word-level speaker labels (rather than utterance-level), and reduced false alarm speakers across production workloads. It's a significant engineering investment that doesn't improve WER at all. What it improves is the user experience. The transcript stays accurate &lt;em&gt;as it's being read&lt;/em&gt; , not just after it's finished.&lt;/p&gt;

&lt;p&gt;That's the kind of investment you make when you understand that transcription quality is a perception problem, not just a measurement problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What "quality" means in production&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's a framework for thinking about speech-to-text accuracy that goes beyond WER:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Word-level accuracy&lt;/strong&gt; is the foundation. You need the right words. Universal-3 Pro achieves a 94.1% word accuracy rate—compared to 93.5% for ElevenLabs Scribe, 92.5% for Microsoft, 92.4% for OpenAI and Amazon, and 92.1% for Deepgram Nova-3 across 26 real-world datasets. That's table stakes for serious applications, and the gap between providers is wider than it looks: at production scale, even a 1–2 percentage point difference compounds across millions of utterances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Entity accuracy&lt;/strong&gt; is where differentiation starts. Getting names, numbers, email addresses, and domain-specific terms right matters disproportionately. A transcript that nails common words but mangles a customer's name or a dollar amount is worse than one with a slightly higher overall error rate that gets the important things right. This is exactly what Missed Entity Rate (MER) measures—and the gaps are significant. &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; achieves a 13.1% MER on names (vs. 15.3–19.4% for competitors), 12.0% on medical terms (vs. 15.3–18.4%), and 34.3% on emails and URLs (vs. 62–72% for every other provider tested). For applications where getting a customer's name or email right on the first try determines whether the interaction succeeds, entity accuracy is the metric that matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Structural accuracy&lt;/strong&gt; is the perception layer. Are speakers correctly labeled? Is punctuation natural? Are sentence boundaries in the right places? Does the transcript read like something a human would produce? This is what determines whether users trust the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal accuracy&lt;/strong&gt; matters for streaming. Are corrections timely enough that users don't notice them? Does the transcript stay coherent as it's being generated? Real-time applications add a fourth dimension to quality that batch processing doesn't have to worry about.&lt;/p&gt;

&lt;p&gt;Most transcription quality best practices focus on the first two layers. But production applications—especially those where humans read the transcripts—live or die on layers three and four. For a complete walkthrough of how to &lt;a href="https://www.assemblyai.com/blog/how-to-evaluate-speech-recognition-models" rel="noopener noreferrer"&gt;evaluate speech recognition models&lt;/a&gt; across all four layers, including ground truth correction and metric selection, see our evaluation guide.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why this is hard to benchmark&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The challenge with perceived quality is that it resists standardization. You can publish a WER number. You can't publish a "perceived quality" number.&lt;/p&gt;

&lt;p&gt;That doesn't mean you can't measure it. User satisfaction surveys, support ticket categorization, A/B testing of formatting choices, monitoring of downstream pipeline accuracy—these are all proxies for perceived quality. They're harder to run than a benchmark, but they tell you something a benchmark can't.&lt;/p&gt;

&lt;p&gt;The industry is starting to build better tools for this. &lt;a href="https://www.assemblyai.com/blog/new-word-error-rate-wer-benchmark" rel="noopener noreferrer"&gt;Semantic WER&lt;/a&gt;—an emerging metric that uses an LLM as a judge to evaluate whether meaning is preserved, rather than checking word-for-word accuracy—is one promising direction. Instead of penalizing a model for transcribing "cannot" instead of "can't," Semantic WER asks whether the intent was preserved. Combined with &lt;a href="https://www.assemblyai.com/blog/word-error-rate-is-broken" rel="noopener noreferrer"&gt;Missed Entity Rate&lt;/a&gt; and domain-specific keyword accuracy, these newer metrics get closer to measuring what users actually experience. We've written extensively about &lt;a href="https://www.assemblyai.com/blog/word-error-rate-is-broken" rel="noopener noreferrer"&gt;why WER alone is insufficient&lt;/a&gt; and what to use instead.&lt;/p&gt;

&lt;p&gt;How to improve transcription quality in production comes down to closing the loop between raw model output and user experience. Measure what your users see, not just what your model produces. If they're complaining about readability, formatting, or speaker accuracy, your WER score doesn't matter. You have a quality problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Perceived quality is the next competitive frontier&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's where the industry is heading: as speech-to-text WER converges across providers—and it will, because the underlying research is increasingly shared—perceived quality becomes the primary differentiator. Two providers with near-identical WER will deliver radically different user experiences depending on how they handle formatting, speaker attribution, and real-time corrections.&lt;/p&gt;

&lt;p&gt;This means the evaluation criteria for &lt;a href="https://www.assemblyai.com/blog/how-accurate-speech-to-text" rel="noopener noreferrer"&gt;speech-to-text&lt;/a&gt; are shifting. Teams that only benchmark on WER are optimizing for a metric that's becoming commoditized. The teams building durable products are asking different questions: Do users trust what they see? Does the output hold up under real-time consumption? Can the system correct itself without breaking the reader's confidence?&lt;/p&gt;

&lt;p&gt;If you're building an application where humans read transcripts—a &lt;a href="https://www.assemblyai.com/solutions/contact-centers" rel="noopener noreferrer"&gt;contact center&lt;/a&gt; agent reviewing a live summary, or a developer building a &lt;a href="https://www.assemblyai.com/solutions/voice-agents" rel="noopener noreferrer"&gt;voice agent&lt;/a&gt; pipeline—the question isn't just "how accurate is the speech-to-text?" It's "how accurate does it &lt;em&gt;feel&lt;/em&gt;?"&lt;/p&gt;

&lt;p&gt;That gap between raw accuracy and perceived quality is where the next generation of &lt;a href="https://www.assemblyai.com/products" rel="noopener noreferrer"&gt;Voice AI&lt;/a&gt; products will be won or lost.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is perceived transcription quality and how does it differ from WER?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Perceived transcription quality measures how accurate a transcript &lt;em&gt;feels&lt;/em&gt; to the person reading it—factoring in speaker labels, formatting, punctuation, and entity accuracy—rather than just word-for-word correctness. WER only counts substitutions, insertions, and deletions against a reference transcript. A transcript with perfect WER can still feel broken if speakers are mislabeled or punctuation is inconsistent, while a slightly higher-WER transcript with clean formatting and correct names often feels more reliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why do users complain about transcription quality even when WER is low?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Because WER treats all words equally and ignores structural elements users care about. A misheard filler word and a mangled customer name both count as one error in WER, but users notice the name error far more. Speaker mislabeling, stray audio tags, and inconsistent punctuation also degrade perceived quality without affecting WER at all. AssemblyAI's Voice Agent Report found that 55% of end users cite "having to repeat themselves" as their top frustration—a perception problem, not a WER problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do you measure transcription quality beyond word error rate?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Newer metrics like Semantic WER (which uses an LLM judge to evaluate meaning preservation), Missed Entity Rate (which tracks accuracy on names, numbers, emails, and domain-specific terms), and domain-specific keyword accuracy get closer to what users experience. AssemblyAI's Universal-3 Pro achieves a 13.1% MER on names and 34.3% on emails/URLs—roughly half the error rate of competitors on the entities that matter most in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is speaker diarization and why does it affect perceived transcription quality?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Speaker diarization identifies "who spoke when" in multi-speaker audio, assigning labels like Speaker A and Speaker B throughout a transcript. When diarization is wrong—even on just 3% of calls—it corrupts downstream analytics, compliance flags, and call summaries. Users lose trust in the entire system because the errors are visible and disruptive, even though WER stays the same. AssemblyAI's diarization achieves a 2.9% error rate on speaker count accuracy, with recent streaming improvements reducing phantom speaker detections by 56%.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How does streaming transcription handle speaker label accuracy in real time?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Streaming diarization must assign speaker labels immediately as audio arrives, with no ability to revise past labels the way batch processing can. Early in a conversation, limited audio context means speaker assignments may be less stable. AssemblyAI addresses this with speaker revision messages—delayed corrections that update labels while the conversation is still active—plus word-level speaker labeling and reduced false alarm rates across production workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is Semantic WER and how does it improve speech-to-text evaluation?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Semantic WER uses a reasoning model (like Claude) to evaluate whether a transcript preserves the meaning of what was said, rather than checking exact word matches. "Cannot" vs. "can't" registers as an error in traditional WER but scores identically in Semantic WER because the meaning is preserved. This matters especially for voice agent pipelines where transcripts feed directly into LLMs—the downstream model doesn't care about exact wording, only intent. Combined with Missed Entity Rate, Semantic WER provides a more complete picture of real-world transcription quality.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>speechtotext</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Best API for building a speech-to-speech voice agent in 2026</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 19 May 2026 19:28:14 +0000</pubDate>
      <link>https://dev.to/martschweiger/best-api-for-building-a-speech-to-speech-voice-agent-in-2026-56np</link>
      <guid>https://dev.to/martschweiger/best-api-for-building-a-speech-to-speech-voice-agent-in-2026-56np</guid>
      <description>&lt;p&gt;A speech-to-speech voice agent API replaces the three separate components most teams used to wire together—&lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;streaming speech-to-text&lt;/a&gt;, a language model, and text-to-speech—with a single API that takes audio in and returns audio out. In 2026, that category has gone from "interesting demo" to "default way to ship a production voice agent," and the gap between providers is now measurable in latency, accuracy, and what they let you do with tool calls.&lt;/p&gt;

&lt;p&gt;This guide compares the speech-to-speech voice agent APIs developers actually pick from in 2026, what each one is best at, and how to choose between a true speech-to-speech API and a chained STT-LLM-TTS pipeline. We'll cover&lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt; AssemblyAI's Voice Agent API&lt;/a&gt;, OpenAI Realtime, Google Gemini Live, Deepgram, ElevenLabs Conversational AI, Retell, Bland, and Hume, plus where Vapi and Pipecat fit if you'd rather orchestrate the components yourself—covered in our&lt;a href="https://www.assemblyai.com/blog/orchestration-tools-ai-voice-agents" rel="noopener noreferrer"&gt; orchestration tools comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is a speech-to-speech voice agent API?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A speech-to-speech voice agent API is a single API endpoint—usually a WebSocket—that accepts a user's audio stream and returns the agent's audio response, with everything in between (transcription, reasoning, tool calls, voice synthesis) hidden behind one connection. You send mic audio in. You get the agent's voice back. You don't manage three providers, three sets of API keys, or three sets of latency budgets.&lt;/p&gt;

&lt;p&gt;That's the practical definition. Under the hood, there are two architectural patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chained (cascading) speech-to-speech APIs&lt;/strong&gt; : Internally pipe streaming STT → LLM → streaming TTS, but expose a single API. The advantage is you can swap each layer for best-in-class models. AssemblyAI's Voice Agent API is the leading example.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native speech-to-speech models&lt;/strong&gt; : A single model trained end-to-end on audio that takes audio tokens in and emits audio tokens out, with no intermediate text in some cases. OpenAI Realtime, Google Gemini Live, and Hume's EVI fall here. The pitch is lower latency and richer audio understanding (laughter, tone). The trade-off is less transparency, smaller language support, and weaker text reasoning than a frontier text LLM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both expose the same developer surface—one connection, audio in/audio out—so the choice is about which trade-offs match your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Best speech-to-speech voice agent APIs in 2026&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Speech accuracy&lt;/th&gt;
&lt;th&gt;P50 latency&lt;/th&gt;
&lt;th&gt;Tool calling&lt;/th&gt;
&lt;th&gt;Languages&lt;/th&gt;
&lt;th&gt;Pricing&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AssemblyAI Voice Agent API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chained, single WebSocket&lt;/td&gt;
&lt;td&gt;Industry-leading on phone audio, alphanumerics (16.7% missed entity rate)&lt;/td&gt;
&lt;td&gt;307ms STT + sub-second end-to-end&lt;/td&gt;
&lt;td&gt;Yes, model-routed, with intermediate speech (no silence during tool calls)&lt;/td&gt;
&lt;td&gt;6 streaming (EN/ES/FR/DE/IT/PT), expanding&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;$4.50/hr&lt;/strong&gt; flat&lt;/td&gt;
&lt;td&gt;Production voice agents where speech accuracy decides whether it ships&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI Realtime API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native speech-to-speech (GPT-4o audio)&lt;/td&gt;
&lt;td&gt;Strong on clean studio audio, weaker on telephony (23.3% missed entity rate)&lt;/td&gt;
&lt;td&gt;~500–800ms end-to-end&lt;/td&gt;
&lt;td&gt;Yes, OpenAI tool format (goes silent during tool calls)&lt;/td&gt;
&lt;td&gt;~50 (varies by feature)&lt;/td&gt;
&lt;td&gt;~$18/hr per-token billing across 30+ event types&lt;/td&gt;
&lt;td&gt;Demos, browser-first apps, conversational toys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deepgram Voice Agent API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chained, cascading&lt;/td&gt;
&lt;td&gt;Good general accuracy, weaker on entities (25.5% missed entity rate)&lt;/td&gt;
&lt;td&gt;~1–1.5 seconds end-to-end&lt;/td&gt;
&lt;td&gt;Yes, custom functions supported (goes silent during tool calls)&lt;/td&gt;
&lt;td&gt;EN, ES, NL, FR, DE, IT, JA&lt;/td&gt;
&lt;td&gt;~$4.50/hr, concurrency commitments required&lt;/td&gt;
&lt;td&gt;Teams already invested in Deepgram's ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Gemini Live API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native speech-to-speech (Gemini 2 audio)&lt;/td&gt;
&lt;td&gt;Strong on Google's voice eval set&lt;/td&gt;
&lt;td&gt;~600–900ms end-to-end&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;30+&lt;/td&gt;
&lt;td&gt;Usage-based, varies by tier&lt;/td&gt;
&lt;td&gt;Apps already on GCP / Gemini, multimodal (vision + voice) demos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ElevenLabs Conversational AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chained, ElevenLabs-orchestrated&lt;/td&gt;
&lt;td&gt;Depends on STT chosen (configurable)&lt;/td&gt;
&lt;td&gt;Sub-second end-to-end&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;30+&lt;/td&gt;
&lt;td&gt;Per-minute, ~$0.09–0.30/min&lt;/td&gt;
&lt;td&gt;Teams that want premium TTS as the headline and don't want to tune STT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retell&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chained, orchestrated&lt;/td&gt;
&lt;td&gt;Configurable STT&lt;/td&gt;
&lt;td&gt;Sub-500ms voice-to-voice on phone&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;20+&lt;/td&gt;
&lt;td&gt;Per-minute, ~$0.07–0.17/min&lt;/td&gt;
&lt;td&gt;Phone-first agents prioritizing turn-taking naturalness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bland&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chained, self-hostable&lt;/td&gt;
&lt;td&gt;Configurable&lt;/td&gt;
&lt;td&gt;Sub-second&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;10+&lt;/td&gt;
&lt;td&gt;Per-minute or self-hosted&lt;/td&gt;
&lt;td&gt;Enterprises with strict data residency / on-prem requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hume EVI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native speech-to-speech, emotion-aware&lt;/td&gt;
&lt;td&gt;Decent&lt;/td&gt;
&lt;td&gt;Sub-second&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;English-focused&lt;/td&gt;
&lt;td&gt;Per-minute&lt;/td&gt;
&lt;td&gt;Emotion-sensitive use cases (mental health, coaching)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vapi&lt;/td&gt;
&lt;td&gt;Orchestration (not S2S, but feels like it)&lt;/td&gt;
&lt;td&gt;Depends on chosen STT&lt;/td&gt;
&lt;td&gt;Sub-second when tuned&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Wide&lt;/td&gt;
&lt;td&gt;Per-minute + pass-through provider costs&lt;/td&gt;
&lt;td&gt;Teams that want to swap STT/LLM/TTS per-deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pipecat / LiveKit Agents&lt;/td&gt;
&lt;td&gt;Open-source orchestration&lt;/td&gt;
&lt;td&gt;Depends on STT&lt;/td&gt;
&lt;td&gt;Sub-second when tuned&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Wide&lt;/td&gt;
&lt;td&gt;Compute + provider costs&lt;/td&gt;
&lt;td&gt;Teams who want full ownership of the pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things stand out in 2026. End-to-end latency under one second is now table stakes, not a differentiator—every provider on this list will get you there with a reasonable network. What separates them is &lt;strong&gt;speech accuracy on real-world audio&lt;/strong&gt; (phone calls, accents, alphanumerics), &lt;strong&gt;how tool calling behaves under load&lt;/strong&gt; , and &lt;strong&gt;whether the pricing model survives contact with a real customer base&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In our&lt;a href="https://www.assemblyai.com/voice-agent-report" rel="noopener noreferrer"&gt; Voice Agent Report&lt;/a&gt;, 76% of respondents rated speech-to-text accuracy as the single most important non-negotiable when building voice agents—above latency, cost, and integration capabilities. That finding maps directly to what we see in the comparison data: the accuracy gap between providers on real-world entities (phone numbers, emails, confirmation codes) is where production agents succeed or fail.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to choose the best speech-to-speech voice agent API&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The voice agent you ship depends on four decisions. Get any of them wrong and the agent feels off, even if the demo was great.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Speech-to-text accuracy on your actual audio&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Most providers benchmark on studio audio. Your users are on phones, in cars, in drive-thrus, and rattling off order numbers and email addresses. The two accuracy metrics that actually matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alphanumeric accuracy&lt;/strong&gt; : How well the model captures phone numbers, confirmation codes, emails, order IDs. This is where the gap between providers shows up most clearly. In head-to-head testing, AssemblyAI's&lt;a href="https://www.assemblyai.com/universal-3-pro-streaming" rel="noopener noreferrer"&gt; Universal-3 Pro Streaming&lt;/a&gt; delivers a 16.7% alphanumeric missed error rate, compared to 23.3% for OpenAI and 25.5% for Deepgram. That's the difference between capturing "RX-7704132" correctly on the first try and hearing "dash seven seven zero four one three two." AssemblyAI's Universal-3 Pro Streaming also delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. This is the single most under-measured metric in voice agent demos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entity accuracy on proper nouns&lt;/strong&gt; : Company names, people's names, drug names, product titles. If your agent writes "Corel" instead of "Coral" into the CRM, the lead is unreachable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Native speech-to-speech models like OpenAI Realtime and Gemini Live were trained more on clean conversational audio than on telephony, which shows up the moment you put them on a Twilio call.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Turn-taking and interruption handling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Poor turn detection is the most common reason voice agents feel unnatural. The agent either talks over the user or sits in awkward silence. The best implementations handle turn detection at the model level, not as an afterthought bolted on with a fixed silence timer.&lt;/p&gt;

&lt;p&gt;AssemblyAI's Universal-3 Pro Streaming includes acoustic turn detection built directly into the model, with semantic endpointing that combines acoustic pauses with intent signals—using a semantic + neural network + VAD approach rather than basic silence-based VAD. Retell ships its own proprietary turn-taking model. OpenAI Realtime's server VAD is competent but configurable timeouts still trip up agents on calls with hesitant speakers. Deepgram relies on traditional VAD only, without the semantic or neural layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Tool calling reliability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Real voice agents don't just talk—they book the appointment, look up the order, charge the card. That means the underlying LLM has to call tools mid-conversation, fast enough that the silence doesn't become obvious.&lt;/p&gt;

&lt;p&gt;The bar to clear: tool calls under 500ms round-trip, structured outputs that don't hallucinate parameters, and the ability to call multiple tools in a single turn. But there's a UX dimension most teams overlook: &lt;strong&gt;what happens while the tool call is executing?&lt;/strong&gt; AssemblyAI's Voice Agent API generates intermediate speech during tool execution—the agent says something like "Let me look that up for you" rather than going silent. Both OpenAI Realtime and Deepgram go silent during tool calls, which creates an awkward dead-air gap that makes users wonder if the connection dropped.&lt;/p&gt;

&lt;p&gt;AssemblyAI's Voice Agent API exposes a clean function-calling surface that routes through the underlying model with structured-output guarantees. OpenAI Realtime supports tool calling natively. Some orchestration platforms add their own retry and validation logic on top.&lt;/p&gt;

&lt;p&gt;If your agent's job is "capture data and put it somewhere"—booking a meeting, qualifying a lead, taking an order, scheduling a callback—tool calling reliability is what decides whether the agent actually does its job.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Pricing model and unit economics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is the trap most teams fall into during pilots. Per-minute pricing looks cheap until you're running 500 simultaneous calls during a support spike. Per-token audio pricing (OpenAI Realtime) is unpredictable because audio output tokens are 10–20x text tokens and a chatty TTS voice burns through your budget.&lt;/p&gt;

&lt;p&gt;A few patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flat hourly pricing&lt;/strong&gt; : AssemblyAI's Voice Agent API at &lt;strong&gt;$4.50/hour&lt;/strong&gt; covers STT, LLM inference, TTS, and tool calling. One bill, one line of math to model what a 5-minute call costs. No separate meters for audio in, audio out, text in, text out. No concurrency commitments. Easy to forecast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-minute, all-in&lt;/strong&gt; : Retell, Bland, ElevenLabs Conversational AI. Predictable, but adds up at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flat hourly with concurrency commitments&lt;/strong&gt; : Deepgram's voice agent API is also ~$4.50/hour, but requires concurrency-metered billing—meaning you're committing to a certain number of simultaneous sessions. That changes the economics at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-token audio&lt;/strong&gt; : OpenAI Realtime. ~$18/hour with 30+ billing event types. Best for low-volume; hard to forecast at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pass-through + platform fee&lt;/strong&gt; : Vapi, LiveKit. You pay each underlying provider plus a platform fee—flexible but more accounting overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Forecast what 100 hours of conversation actually costs across the providers you're considering. The order of magnitude is real, especially once you stop being charged for demo calls and start being charged for production.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;AssemblyAI Voice Agent API: one WebSocket, flat-rate, built on Universal-3 Pro Streaming&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AssemblyAI's Voice Agent API is a single WebSocket that takes user audio in and streams agent audio out, with STT, LLM, TTS, turn detection, and tool calling handled inside the connection. It replaces three separate providers with one bill, one set of logs, and one set of latency variables to tune.&lt;/p&gt;

&lt;p&gt;What makes it work as a speech-to-speech voice agent API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speech accuracy that survives phone audio.&lt;/strong&gt; The STT layer is Universal-3 Pro Streaming, the same model trusted by enterprise voice agent teams for production deployments—307ms P50 latency, native 8kHz mulaw support, immutable transcripts, and a 16.7% alphanumeric missed error rate that's measurably better than OpenAI (23.3%) and Deepgram (25.5%). When the STT is this accurate, the whole conversation is better because the agent is actually responding to what was said.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calling that doesn't go silent.&lt;/strong&gt; Define your tools, the model calls them, results stream back into the conversation. Unlike OpenAI Realtime and Deepgram, the agent generates intermediate speech during tool execution—natural transition phrases like "Let me check on that"—so your users never hear dead air. Useful for the lead-qualification, appointment-setting, and structured-data-capture use cases where voice agents have the strongest product-market fit today.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mid-session updates without reconnecting.&lt;/strong&gt; Update the system prompt, voice, tools, and VAD settings mid-conversation with a JSON message—no reconnection, no redeployment. OpenAI Realtime only supports updating prompt and tools. Deepgram supports prompt and voice only. AssemblyAI is the only provider that lets you update all four mid-session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session resumption.&lt;/strong&gt; If the WebSocket drops, reconnect within 30 seconds and pick up where the conversation left off. Context is preserved. Neither OpenAI Realtime nor Deepgram offers session resumption—a dropped connection means starting over.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flat-rate pricing.&lt;/strong&gt; $4.50/hour of session time, no per-token audio surprises, no per-provider invoices, no concurrency commitments. This includes STT, LLM, TTS, turn detection, and tool calling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One API to learn.&lt;/strong&gt; The Voice Agent API is one WebSocket. You don't wire together a streaming STT WebSocket, an LLM HTTP endpoint, a TTS streaming connection, and your own turn-detection logic. The plumbing is in the API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built for production.&lt;/strong&gt; Unlimited concurrency, session resumption, structured logs per session, and the same SOC 2 / BAA-eligible infrastructure that already runs AssemblyAI's speech-to-text platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where it fits in the landscape: AssemblyAI's Voice Agent API is the choice when &lt;strong&gt;speech accuracy decides whether the agent ships&lt;/strong&gt;. If your agent is taking phone calls, capturing structured data, or operating in a regulated industry where you need a BAA, this is the speech-to-speech voice agent API to build on.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;When to use a chained pipeline instead&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A speech-to-speech voice agent API is the right answer for most teams in 2026. But there are three cases where chaining the layers yourself still wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need a specific LLM&lt;/strong&gt; : A frontier text LLM like Claude or Gemini that isn't exposed inside any S2S API yet. Most S2S APIs let you choose, but if you need a model that isn't on the list, chain it yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need a specific TTS voice&lt;/strong&gt; : A cloned voice, a specific accent, or a non-standard language model. Most S2S APIs let you bring your own TTS, but if you need fine control, a chained pipeline is more flexible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You have regulated data residency&lt;/strong&gt; : Some industries require every layer to run in your VPC. A chained, self-hosted pipeline (with Bland for the orchestration, or fully self-built) is the only path.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're chaining, the layer that decides whether the agent works is still the streaming STT. The&lt;a href="https://www.assemblyai.com/blog/real-time-speech-to-text-best-for-voice-agents" rel="noopener noreferrer"&gt; best streaming speech-to-text model for voice agents&lt;/a&gt; discussion comes down to the same accuracy and latency criteria covered above.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Common use cases for speech-to-speech voice agents&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The pattern in 2026 is consistent:&lt;a href="https://www.assemblyai.com/blog/ai-voice-agents" rel="noopener noreferrer"&gt; speech-to-speech voice agents&lt;/a&gt; work best on high-volume, structured calls where the agent's job is to &lt;strong&gt;capture or look up data&lt;/strong&gt; rather than reason open-endedly. The teams shipping production agents converge on these use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lead qualification and outbound sales discovery&lt;/strong&gt; : Ask BANT questions, book qualified meetings, sync to the CRM. Turn-taking quality is the differentiator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Appointment scheduling and confirmations&lt;/strong&gt; : Medical offices, salons, service businesses. Alphanumeric accuracy on dates, times, and confirmation codes is non-negotiable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Food ordering and reservations&lt;/strong&gt; : High-accuracy data capture on menu items, special requests, payment info.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer support tier-1 deflection&lt;/strong&gt; : Order status, account questions, basic troubleshooting. Best paired with explicit escalation paths. See our guide to&lt;a href="https://www.assemblyai.com/blog/voice-ai-for-customer-service" rel="noopener noreferrer"&gt; Voice AI for customer service&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insurance verification and benefits lookup&lt;/strong&gt; : Getting plan numbers, group IDs, and member info right the first time—the same accuracy bar that drives&lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt; voice agents in healthcare&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outbound reminders and surveys&lt;/strong&gt; : Post-visit follow-ups, payment reminders, satisfaction surveys.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread across all of these: the agent is capturing or retrieving specific data, the conversation has a predictable structure, and the cost of a transcription error is concrete. That's where a speech-to-speech voice agent API earns its keep over a human agent or an IVR.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to evaluate a speech-to-speech voice agent API before you commit&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Demos are unreliable. Vendor benchmarks are unreliable. Here's the evaluation loop teams actually use before signing a contract:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Record 50 real or representative calls&lt;/strong&gt; for your use case, including accents, background noise, alphanumeric content, and interruptions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run them through each API's playground or trial.&lt;/strong&gt; Measure word error rate (WER) on the alphanumeric tokens specifically—phone numbers, confirmation codes, emails, dollar amounts. General WER is misleading. Look at the missed entity rates: AssemblyAI sits at 16.7%, OpenAI at 23.3%, Deepgram at 25.5%. Run your own audio to see how those numbers hold on your data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time the turn-taking.&lt;/strong&gt; Mark every "caller-stops-speaking" moment and measure how long until the agent starts responding. Sub-800ms is the threshold for natural-feeling conversation. Pay attention to how each provider handles turn detection—semantic + neural approaches outperform basic VAD on hesitant or accented speakers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test tool calling under load.&lt;/strong&gt; Define three real tools and have the agent call them mid-conversation. Measure round-trip time and error rate. Also note whether the agent speaks naturally during tool execution or goes silent—this makes a bigger UX difference than most teams expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read every transcript.&lt;/strong&gt; You'll catch prompt failures, silently wrong transcriptions, and hallucinated tool parameters that you'd never notice by listening.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Most teams skip step 2 and ship with a model that fumbles confirmation codes silently. Don't.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final words&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The right speech-to-speech voice agent API in 2026 depends less on the marketing material and more on what your agent has to actually hear. If your users are on phones, capturing structured data, or operating in regulated environments, the bar is speech accuracy first, latency second, and pricing predictability third—in that order. The chained-architecture S2S APIs (with AssemblyAI's Voice Agent API as the leading example for accuracy-critical use cases) tend to outperform native speech-to-speech models on real-world telephony, even when the native models look better in studio-audio demos.&lt;/p&gt;

&lt;p&gt;For most teams shipping a production voice agent this year, the AssemblyAI Voice Agent API is the right starting point. One WebSocket, $4.50/hour, Universal-3 Pro Streaming for the parts that matter, and flat-rate pricing you can forecast. Teams that need finer control over the stack can drop our&lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt; Streaming Speech-to-Text product&lt;/a&gt; into their existing&lt;a href="https://www.assemblyai.com/solutions/voice-agents" rel="noopener noreferrer"&gt; voice agent orchestrator&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is a speech-to-speech voice agent API?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A speech-to-speech voice agent API is a single API—usually a WebSocket—that accepts a user's audio stream and returns the agent's audio response. It hides the streaming speech-to-text, language model, tool calling, and text-to-speech behind one connection, so developers don't have to manage three separate providers, three API keys, or three latency budgets to ship a voice agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is the best speech-to-speech voice agent API in 2026?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The best speech-to-speech voice agent API in 2026 is AssemblyAI's Voice Agent API for production deployments where speech accuracy matters—it's a single WebSocket built on Universal-3 Pro Streaming with 307ms P50 latency, native phone-audio support, tool calling, and flat $4.50/hour pricing. In our Voice Agent Report, 76% of builders rated transcription accuracy as the most important non-negotiable, and AssemblyAI delivers the lowest alphanumeric missed error rate (16.7%) compared to OpenAI (23.3%) and Deepgram (25.5%). OpenAI Realtime is competitive for browser-first demos. Retell is competitive for phone-first agents prioritizing turn-taking naturalness. The right choice depends on whether your users are on phones, what data the agent has to capture, and how predictable you need pricing to be.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How does a speech-to-speech voice agent API differ from chaining STT, LLM, and TTS yourself?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A speech-to-speech voice agent API gives you one API endpoint that takes audio in and returns audio out, with STT, LLM, TTS, turn detection, and tool calling handled inside the API. Chaining the layers yourself gives you full control over each component—choice of LLM, choice of TTS voice, on-prem deployment—but you own the plumbing: the WebSocket bridge, turn detection logic, retry handling, and three separate provider relationships. Most teams in 2026 default to a speech-to-speech voice agent API and only chain when they need a specific LLM, voice, or data residency setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Which speech-to-speech voice agent API is cheapest?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AssemblyAI's Voice Agent API at &lt;strong&gt;$4.50/hour&lt;/strong&gt; flat-rate is the most predictable and one of the lowest unit costs in the category—one bill, no concurrency commitments, and you can model what a 5-minute call costs in one line of math. Per-minute APIs like Retell and ElevenLabs Conversational AI typically land between $0.07 and $0.30 per minute depending on tier, which works out to ~$4.20–$18/hour. Deepgram's voice agent API is also ~$4.50/hour but requires concurrency-metered billing, which changes the economics at scale. OpenAI Realtime runs ~$18/hour with per-token billing across 30+ event types—cheaper for low-volume but significantly more expensive and less predictable at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can I use a speech-to-speech voice agent API with Twilio?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Most speech-to-speech voice agent APIs can be bridged to Twilio Voice with a WebSocket server that forwards Twilio's 8kHz mulaw audio into the speech-to-speech API and streams the agent's audio response back as mulaw frames for Twilio to play. The cleanest setup uses an API that accepts mulaw natively at 8kHz—AssemblyAI's Voice Agent API and Universal-3 Pro Streaming both support this without resampling, which saves latency. Some providers like Retell ship a Twilio adapter directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Do speech-to-speech voice agent APIs support multiple languages?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes, but coverage varies widely. AssemblyAI's Voice Agent API launched with 6 streaming languages (English, Spanish, French, German, Italian, Portuguese) with native code-switching, and language coverage is expanding. OpenAI Realtime supports around 50 languages but has hallucination and language-switching issues mid-call. Google Gemini Live covers 30+. If you need a specific language combination, test with real audio in those languages before you commit—language support varies significantly between studio benchmarks and real-world phone audio.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I evaluate which speech-to-speech voice agent API is best for my use case?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Record 50 representative calls for your use case, run them through each API's playground or trial, and measure four things: word error rate on the entities that matter (phone numbers, confirmation codes, names, emails), end-to-end turn-taking latency, tool call round-trip time, and unit cost at your expected volume. General WER and marketing benchmarks are misleading—the only evaluation that predicts production behavior is the one that uses your audio, your tools, and your scale.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>comparison</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to build a voice agent with Twilio and AssemblyAI</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 19 May 2026 19:27:30 +0000</pubDate>
      <link>https://dev.to/martschweiger/how-to-build-a-voice-agent-with-twilio-and-assemblyai-439m</link>
      <guid>https://dev.to/martschweiger/how-to-build-a-voice-agent-with-twilio-and-assemblyai-439m</guid>
      <description>&lt;p&gt;Building a voice agent on Twilio with AssemblyAI takes one WebSocket server that bridges Twilio Voice Media Streams into Universal-3 Pro Streaming, your LLM of choice, and a text-to-speech model — all under an 800ms turn budget. This tutorial walks through every piece: the TwiML to open the audio stream, the FastAPI WebSocket bridge that handles 8kHz mulaw audio in both directions, the LLM loop with tool calling, and the deployment considerations that decide whether your agent feels human or obviously robotic on a real phone call.&lt;/p&gt;

&lt;p&gt;By the end of this guide, you'll have a working inbound phone-based voice agent that answers a Twilio number, transcribes the caller in real time, calls tools (order lookup, callback scheduling, human transfer), and speaks back — all with code you can fork and ship today. The full repository is at the end of this post.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Twilio + AssemblyAI works for phone-based voice agents&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Twilio is the most common telephony layer for &lt;a href="https://www.assemblyai.com/solutions/voice-agents" rel="noopener noreferrer"&gt;voice agents&lt;/a&gt; because it handles the PSTN connection, gives you a phone number in minutes, and exposes the call audio as a Media Stream you can bridge into your own backend over a WebSocket. The audio comes in at 8kHz mulaw — the standard telephony format, not the 16kHz PCM most audio tools assume.&lt;/p&gt;

&lt;p&gt;AssemblyAI's &lt;a href="https://www.assemblyai.com/universal-3-pro-streaming" rel="noopener noreferrer"&gt;Universal-3 Pro Streaming&lt;/a&gt; model is built specifically for this. It accepts pcm_mulaw at sample_rate=8000 natively, so you don't pay the round-trip latency cost of resampling phone audio into 16kHz PCM and back. Combined with 307ms P50 latency, immutable transcripts, and 21% fewer alphanumeric errors than the previous generation of streaming speech-to-text models, it's the speech-to-text layer that decides whether your agent captures a confirmation code on the first try or makes the caller repeat it.&lt;/p&gt;

&lt;p&gt;The architecture is straightforward:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Caller's phone
       │
   Twilio Voice (PSTN)
       │  TwiML → open WebSocket
       ▼
  Your FastAPI server (this tutorial)
   ┌────┴────┐
   ▼         ▲
 AssemblyAI    ElevenLabs TTS
 Universal-3   (ulaw_8000 output)
 Pro Streaming
   │             ▲
   │ transcript  │ audio
   ▼             │
   GPT-4o + tool calling
     │
     └─► action + spoken reply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Audio flows in two directions continuously. Twilio sends inbound audio (caller → your server → AssemblyAI). Your server generates an LLM response, runs it through ElevenLabs, and streams the synthesized audio back to Twilio as mulaw frames. All of it stays inside one WebSocket per call.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Before you start&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;AssemblyAI account&lt;/a&gt; with API key access to Universal-3 Pro Streaming&lt;/li&gt;
&lt;li&gt;A Twilio account with a Voice-enabled phone number&lt;/li&gt;
&lt;li&gt;An OpenAI API key (or another LLM provider)&lt;/li&gt;
&lt;li&gt;An ElevenLabs API key (or another streaming TTS provider with mulaw output)&lt;/li&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;li&gt;ngrok for exposing your local server to Twilio during development&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install the dependencies:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install fastapi uvicorn websockets python-dotenv openai elevenlabs twilio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Configure the Twilio TwiML for an inbound call&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When someone calls your Twilio number, Twilio fetches a TwiML document from your server and uses it to decide what to do with the call. To stream the call audio to your WebSocket, you return TwiML with a  block:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# server.py
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice(request: Request):
    host = request.url.hostname
    twiml = f"""&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;
&amp;lt;Response&amp;gt;
  &amp;lt;Connect&amp;gt;
    &amp;lt;Stream url="wss://{host}/media-stream" /&amp;gt;
  &amp;lt;/Connect&amp;gt;
&amp;lt;/Response&amp;gt;"""
    return Response(content=twiml, media_type="application/xml")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In the Twilio console, set the phone number's voice webhook to POST &lt;a href="https://your-host/twilio/voice" rel="noopener noreferrer"&gt;https://your-host/twilio/voice&lt;/a&gt;. When a call comes in, Twilio will hit this endpoint, parse the TwiML, and open a WebSocket to /media-stream that carries the call audio.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Bridge Twilio Media Streams to Universal-3 Pro Streaming&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is the core of the agent. The WebSocket handler receives Twilio's audio frames, forwards them to AssemblyAI, listens for transcripts, and routes them into the LLM loop.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# server.py (continued)
import asyncio
import base64
import json
import os
import websockets
from fastapi import WebSocket

ASSEMBLY_WS = "wss://streaming.assemblyai.com/v3/ws"

@app.websocket("/media-stream")
async def media_stream(twilio_ws: WebSocket):
    await twilio_ws.accept()
    stream_sid = None

    # Open AssemblyAI streaming session — note: pcm_mulaw, 8kHz
aai_url = (
    f"{ASSEMBLY_WS}"
    f"?speech_model=u3-rt-pro"
    f"&amp;amp;encoding=pcm_mulaw"
    f"&amp;amp;sample_rate=8000"
)
aai_ws = await websockets.connect(
    aai_url,
    extra_headers={"Authorization": os.environ["ASSEMBLYAI_API_KEY"]},
)

    async def pump_twilio_to_aai():
        nonlocal stream_sid
        async for raw in twilio_ws.iter_text():
            event = json.loads(raw)
            if event["event"] == "start":
                stream_sid = event["start"]["streamSid"]
            elif event["event"] == "media":
                audio_b64 = event["media"]["payload"]
                # Twilio sends base64-encoded mulaw. AssemblyAI accepts raw bytes.
                await aai_ws.send(base64.b64decode(audio_b64))
            elif event["event"] == "stop":
                await aai_ws.close()
                return

    async def pump_aai_to_llm():
        async for message in aai_ws:
            data = json.loads(message)
            if data.get("type") == "Turn" and data.get("end_of_turn"):
                transcript = data.get("transcript", "").strip()
                if transcript:
                    await handle_user_turn(transcript, twilio_ws, stream_sid)

    await asyncio.gather(pump_twilio_to_aai(), pump_aai_to_llm())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The critical settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;speech_model=u3-rt-pro selects Universal-3 Pro Streaming&lt;/li&gt;
&lt;li&gt;encoding=pcm_mulaw and sample_rate=8000 tell AssemblyAI to expect raw mulaw without resampling&lt;/li&gt;
&lt;li&gt;format_turns=true gives you properly cased and punctuated transcripts ready for the LLM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When end_of_turn is true, the caller has finished speaking and you have a complete utterance to send to the LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Run the LLM loop with tool calling&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;handle_user_turn is where the conversation logic lives. It takes the transcript, sends it to the LLM with the available tools, and either calls a tool or responds with text that becomes the agent's spoken reply.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# server.py (continued)
from openai import AsyncOpenAI

openai = AsyncOpenAI()

TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the status of a customer order by order ID.",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string", "description": "e.g. AB3792"}
                },
                "required": ["order_id"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "transfer_to_human",
            "description": "Transfer the caller to a human agent.",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {"type": "string"}
                },
                "required": ["reason"],
            },
        },
    },
]

conversation = [
    {
        "role": "system",
        "content": (
            "You are a friendly phone-based voice agent for a shoe retailer. "
            "Keep replies short — one or two sentences. "
            "Use get_order_status to look up orders. "
            "Use transfer_to_human if the caller asks for a person or is upset."
        ),
    }
]

async def handle_user_turn(transcript, twilio_ws, stream_sid):
    conversation.append({"role": "user", "content": transcript})
    response = await openai.chat.completions.create(
        model="gpt-4o",
        messages=conversation,
        tools=TOOLS,
        tool_choice="auto",
    )
    msg = response.choices[0].message

if msg.tool_calls:
    conversation.append(msg.model_dump())
    for call in msg.tool_calls:
        result = await dispatch_tool(call.function.name, call.function.arguments)
        conversation.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": result,
        })
    followup = await openai.chat.completions.create(
        model="gpt-4o", messages=conversation
    )
    reply = followup.choices[0].message.content
    else:
        reply = msg.content

    conversation.append({"role": "assistant", "content": reply})
    await speak(reply, twilio_ws, stream_sid)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The tool dispatcher is where your business logic lives. For a real deployment, replace the stubs with calls to your CRM, order management system, or scheduling backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Stream the TTS audio back to Twilio as mulaw&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Twilio expects audio frames as base64-encoded mulaw at 8kHz. ElevenLabs supports a ulaw_8000 output format that produces exactly this — which means no resampling, no conversion, just stream the bytes back.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# server.py (continued)
from elevenlabs.client import AsyncElevenLabs

eleven = AsyncElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

async def speak(text, twilio_ws, stream_sid):
    audio_stream = eleven.text_to_speech.stream(
        voice_id=os.environ.get("ELEVENLABS_VOICE_ID", "EXAVITQu4vr4xnSDxMaL"),
        text=text,
        model_id="eleven_turbo_v2_5",
        output_format="ulaw_8000",
    )
    async for chunk in audio_stream:
        payload = base64.b64encode(chunk).decode()
        await twilio_ws.send_text(json.dumps({
            "event": "media",
            "streamSid": stream_sid,
            "media": {"payload": payload},
        }))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Each chunk gets streamed to Twilio as a media event. Twilio plays the audio to the caller as it arrives, which means the caller hears the first word of the agent's reply while the rest is still being synthesized.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Run it and connect Twilio&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Start your server and expose it through ngrok:&lt;/p&gt;

&lt;p&gt;uvicorn server:app --port 8000&lt;br&gt;&lt;br&gt;
ngrok http 8000&lt;/p&gt;

&lt;p&gt;Copy the https://*.ngrok-free.dev URL ngrok prints. In the Twilio console:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Buy or pick a Voice-enabled phone number&lt;/li&gt;
&lt;li&gt;Open the number's configuration&lt;/li&gt;
&lt;li&gt;Under "A call comes in," set the webhook to &lt;a href="https://your-ngrok-url/twilio/voice" rel="noopener noreferrer"&gt;https://your-ngrok-url/twilio/voice&lt;/a&gt; with method POST&lt;/li&gt;
&lt;li&gt;Save&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Call the number from your phone. You should hear the agent pick up and respond in natural conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Latency budget: where your milliseconds go&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A natural-feeling phone agent answers in under 800ms from when the caller stops speaking to when the caller hears the first audio of the reply. Here's where that budget gets spent on a Twilio + AssemblyAI stack:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Typical latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI end-of-turn finalization&lt;/td&gt;
&lt;td&gt;~150–250ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM first-token generation (GPT-4o)&lt;/td&gt;
&lt;td&gt;~200–400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS first-byte (ElevenLabs streaming)&lt;/td&gt;
&lt;td&gt;~200–400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Twilio round-trip&lt;/td&gt;
&lt;td&gt;~50–100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total perceived latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~600–1100ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three things blow the budget the moment you stop being careful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resampling audio.&lt;/strong&gt; Anything that converts 8kHz mulaw to 16kHz PCM (and back) costs 50–150ms each way. AssemblyAI's Universal-3 Pro Streaming and ElevenLabs's ulaw_8000 output both keep audio in mulaw end-to-end.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-streaming LLMs.&lt;/strong&gt; Waiting for the full response before TTS starts is a guaranteed dead zone. Stream tokens from the LLM and chunk them to TTS sentence-by-sentence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cold-start tools.&lt;/strong&gt; A tool call that hits a slow database eats your entire turn. Cache hot data and aggressively timeout slow lookups.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What about the AssemblyAI Voice Agent API?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If your voice agent doesn't need Twilio specifically — for example a browser-based assistant, a mobile app, or an embedded device — the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt; wraps STT, LLM, TTS, turn detection, and tool calling behind a single WebSocket at a flat $4.50/hour (&lt;a href="https://www.assemblyai.com/blog/introducing-our-voice-agent-api" rel="noopener noreferrer"&gt;announcement&lt;/a&gt;). You skip the three-provider plumbing entirely.&lt;/p&gt;

&lt;p&gt;For Twilio-bridged phone calls today, the chained architecture in this tutorial is still the most flexible path — it lets you pick exactly the LLM, TTS voice, and tool definitions you want. The Voice Agent API is the right choice for everything that isn't a PSTN inbound call, and Twilio integration through the Voice Agent API is on the roadmap.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The complete repository&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Fork the runnable repo at &lt;a href="https://github.com/kelsey-aai/twilio-voice-agent-assemblyai" rel="noopener noreferrer"&gt;github.com/kelsey-aai/twilio-voice-agent-assemblyai&lt;/a&gt;. It includes the FastAPI server, tool dispatcher, sample tools (get_order_status, transfer_to_human), a .env.example, and ngrok setup instructions. Total length: ~250 lines of Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I build a voice agent with Twilio and AssemblyAI?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To build a voice agent with Twilio and AssemblyAI, point your Twilio phone number at a TwiML endpoint that opens a  to your server's WebSocket. In the WebSocket handler, forward Twilio's 8kHz mulaw audio frames to AssemblyAI's Universal-3 Pro Streaming API using encoding=pcm_mulaw and sample_rate=8000. When AssemblyAI returns a finalized turn, pass the transcript to an LLM (GPT-4o, Claude) with your tool definitions — see our &lt;a href="https://www.assemblyai.com/blog/build-voice-agent-function-calling" rel="noopener noreferrer"&gt;function calling tutorial&lt;/a&gt; for a deeper walkthrough — then stream the LLM's reply through a TTS model that supports ulaw_8000 output (like ElevenLabs) back to Twilio as base64-encoded media events.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why use AssemblyAI for a Twilio voice agent?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AssemblyAI's Universal-3 Pro Streaming model is built for the audio Twilio actually sends — 8kHz mulaw — without requiring resampling, which costs latency. For an overview of the broader category, see &lt;a href="https://www.assemblyai.com/blog/ai-voice-agents" rel="noopener noreferrer"&gt;AI voice agents in 2026&lt;/a&gt;. It delivers 307ms P50 latency, immutable transcripts your downstream LLM can trust, and 21% fewer alphanumeric errors than the previous generation, which matters when the agent is capturing confirmation codes, phone numbers, or email addresses over a phone line.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Does the Voice Agent API work with Twilio?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The AssemblyAI Voice Agent API is the simplest path for voice agents that don't need Twilio specifically — a single WebSocket replaces STT, LLM, and TTS at $4.50/hour. Native Twilio integration through the Voice Agent API is on the roadmap. Today, the chained architecture in this tutorial (Universal-3 Pro Streaming + your LLM + your TTS, bridged through a Twilio Media Streams WebSocket) is the standard path for Twilio-based phone agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What latency should I expect from a Twilio voice agent?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A well-tuned Twilio voice agent built on AssemblyAI Universal-3 Pro Streaming, GPT-4o, and ElevenLabs typically hits 600–1100ms from caller-stops-talking to caller-hears-reply. The biggest latency killers are resampling audio (use native mulaw end-to-end), non-streaming LLM responses (stream tokens), and slow tool calls (cache and timeout aggressively).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How much does it cost to run a phone-based voice agent?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The cost breaks down across four components: Twilio voice (per-minute, varies by country), AssemblyAI Universal-3 Pro Streaming ($0.15/hour of session time), the LLM (varies by provider — typically a few cents per minute of conversation for GPT-4o), and TTS (per-character or per-minute). End-to-end you're looking at a few cents per minute at scale, with the exact number driven by which LLM and TTS you choose.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can a Twilio voice agent handle multiple simultaneous calls?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes. AssemblyAI's Universal-3 Pro Streaming supports unlimited concurrent streams at a flat $0.15/hour with no separate negotiation required. Twilio handles concurrency per-account based on your plan. The constraint at scale is usually your own server's WebSocket concurrency limits — FastAPI with uvicorn workers handles hundreds of concurrent calls comfortably on modest hardware.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>telephony</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Build an AI voice agent for customer support that can look up orders</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 19 May 2026 19:27:21 +0000</pubDate>
      <link>https://dev.to/martschweiger/build-an-ai-voice-agent-for-customer-support-that-can-look-up-orders-4nlj</link>
      <guid>https://dev.to/martschweiger/build-an-ai-voice-agent-for-customer-support-that-can-look-up-orders-4nlj</guid>
      <description>&lt;p&gt;Tier-1 customer support is mostly the same five conversations on repeat: where's my order, can I change my address, can I get a refund, when does this ship, can I talk to a human. They're predictable, they're high-volume, and they don't need a person — they need a voice agent that can actually look things up.&lt;/p&gt;

&lt;p&gt;This tutorial walks you through building one. By the end, you'll have a Python voice agent that answers calls, listens for an order ID or email, calls into your backend to check the status, and reads the result back to the customer in real time. When something goes off-script, it transfers to a human with the full conversation context attached.&lt;/p&gt;

&lt;p&gt;We're using &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;AssemblyAI's Voice Agent API&lt;/a&gt; — one WebSocket that handles the speech understanding, LLM reasoning, voice generation, turn detection, and tool calling in a single connection. Total time to a working prototype: about an afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why most support voice agents fail
&lt;/h2&gt;

&lt;p&gt;Before we build, it's worth knowing where these things break. The pattern is almost always the same:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Customer says "my order ID is A-B-3-7-9-2"&lt;/li&gt;
&lt;li&gt;STT mishears it as "a b 37 92" or "ABE 379 to"&lt;/li&gt;
&lt;li&gt;The LLM calls get_order_status("ab3792") or worse, asks the customer to repeat&lt;/li&gt;
&lt;li&gt;Customer hangs up&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agent didn't fail because the LLM was wrong. It failed because the speech-to-text layer couldn't capture the entity correctly. This is why entity accuracy on alphanumerics, emails, and phone numbers matters more than overall WER for support agents — and why we're building on Universal-3 Pro Streaming, which has a 16.7% mixed-entity error rate vs. 23–25% for competing models.&lt;/p&gt;

&lt;p&gt;The second-most-common failure: dead air during tool calls. The customer asks a question, the agent calls a backend, and there's a 2–3 second silence while the lookup runs. The Voice Agent API solves this by speaking a natural transition phrase ("let me check that for you") while the tool runs — no dead air, no awkward pauses.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you'll build
&lt;/h2&gt;

&lt;p&gt;A Python voice support agent that handles three real workflows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Order status lookup&lt;/strong&gt; — customer says "where's my order?" → agent asks for the ID → looks it up → reads back status, ETA, tracking number&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer info verification&lt;/strong&gt; — customer provides email or phone number → agent looks up the account → confirms identity before proceeding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human escalation&lt;/strong&gt; — customer asks for a person, or the agent gets stuck → graceful transfer with conversation context preserved&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AssemblyAI Voice Agent API (one WebSocket: STT + LLM + TTS)&lt;/li&gt;
&lt;li&gt;Python 3.9+&lt;/li&gt;
&lt;li&gt;A backend with order data — we'll mock it; replace with your real CRM or order management system&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install "websockets&amp;gt;=14" pyaudio python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Create .env:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ASSEMBLYAI_API_KEY=your_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Voice Agent API uses a single endpoint: wss://agents.assemblyai.com/v1/ws. One key, one connection, no separate STT or TTS providers to wire in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Define the support tools
&lt;/h2&gt;

&lt;p&gt;Tools are the agent's interface to your backend. The Voice Agent API uses standard JSON Schema, so anything you can describe with a schema, the agent can call.&lt;/p&gt;

&lt;p&gt;For a support agent, you typically want four tools:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json

TOOLS = [
    {
        "type": "function",
        "name": "get_order_status",
        "description": "Look up an order's current status, shipping ETA, and 
tracking number by order ID.",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The customer's order ID, e.g. ORD-12345 or
78231-ABC.",
                },
            },
            "required": ["order_id"],
        },
    },
    {
        "type": "function",
        "name": "lookup_account_by_email",
        "description": "Find a customer account using their email address.",
        "parameters": {
            "type": "object",
            "properties": {
                "email": {"type": "string", "description": "The customer's email
address."},
            },
            "required": ["email"],
        },
    },
    {
        "type": "function",
        "name": "list_recent_orders",
        "description": "List the customer's most recent orders. Use after the 
account is verified.",
        "parameters": {
            "type": "object",
            "properties": {
                "account_id": {"type": "string"},
                "limit": {"type": "integer", "description": "Max number of orders 
to return.", "default": 5},
            },
            "required": ["account_id"],
        },
    },
    {
        "type": "function",
        "name": "transfer_to_human",
        "description": "Transfer the call to a human agent. Use when the customer 
asks, when you can't help, or when the issue is sensitive.",
        "parameters": {
            "type": "object",
            "properties": {
                "reason": {"type": "string", "description": "Short reason for the 
transfer."},
                "summary": {"type": "string", "description": "Brief summary of the 
conversation so far."},
            },
            "required": ["reason", "summary"],
        },
    },
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now implement the actual functions. Replace these stubs with calls to your real backend:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ORDERS_DB = {
    "ORD-12345": {"status": "shipped", "eta": "2026-05-09", "tracking": 
"1Z999AA10123456784"},
    "ORD-67890": {"status": "processing", "eta": "2026-05-12", "tracking": None},
}

ACCOUNTS_DB = {
    "jane@example.com": {"account_id": "ACC-001", "name": "Jane Doe"},
}

ACCOUNT_ORDERS = {
    "ACC-001": [
        {"order_id": "ORD-12345", "date": "2026-05-01", "total": "$84.99"},
        {"order_id": "ORD-12100", "date": "2026-04-22", "total": "$42.00"},
    ],
}

def run_tool(name: str, args: dict) -&amp;gt; dict:
    if name == "get_order_status":
        order = ORDERS_DB.get(args["order_id"].upper())
        if not order:
            return {"error": "order_not_found", "order_id": args["order_id"]}
        return order

    if name == "lookup_account_by_email":
        account = ACCOUNTS_DB.get(args["email"].lower())
        if not account:
            return {"error": "account_not_found"}
        return account

    if name == "list_recent_orders":
        orders = ACCOUNT_ORDERS.get(args["account_id"], [])
        return {"orders": orders[: args.get("limit", 5)]}

    if name == "transfer_to_human":
        # In production: trigger your call routing / queue handoff here
        return {"transferred": True, "queue": "support-tier-2"}

    return {"error": f"unknown_tool: {name}"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The error-shape pattern matters. When get_order_status can't find an order, it returns a structured error rather than throwing — that gives the LLM the context it needs to apologize and ask the customer to verify the ID, instead of crashing the conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Write the system prompt
&lt;/h2&gt;

&lt;p&gt;The system prompt is where you encode the agent's behavior. For support, you want a few things every time: identity and tone, when to ask for verification before sharing details, when to use which tool, when to transfer to a human, and specific phrasing for transition moments (the "let me check that" line).&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SYSTEM_PROMPT = """
You are Avery, a customer support agent for Acme Corp. Your goal is to help c
ustomers
quickly and accurately. You have access to tools that let you look up orders and
accounts.

Behavior rules:
- Greet warmly and ask how you can help.
- For order questions, ask for the order ID first if the customer hasn't given it.
- If a customer gives an email or phone number, use lookup_account_by_email to
verify.
- Read order status, ETA, and tracking number clearly. Don't read raw timestamps —
  say dates naturally (e.g., "Friday, May 9th").
- When you need to call a tool, say a brief transition like "Let me check on that"
  or "One moment while I pull that up."
- If the customer asks for a human, sounds frustrated, or has a complex issue
  (refund disputes, damaged product, billing errors), use transfer_to_human and
  include a short summary.
- Never make up an order ID, status, or tracking number. If a tool returns an 
error,
  apologize, ask the customer to verify the ID, and try again.
- Keep replies short and conversational. This is a phone call, not an email.
"""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The "never make up" line is the most important sentence in the prompt. Without it, LLMs sometimes invent plausible-sounding tracking numbers when the lookup fails. With it, they ask for clarification instead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Connect to the Voice Agent API
&lt;/h2&gt;

&lt;p&gt;Now the WebSocket connection. The pattern is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open wss://agents.assemblyai.com/v1/ws with your API key&lt;/li&gt;
&lt;li&gt;Send session.update with the system prompt, tools, voice, and greeting&lt;/li&gt;
&lt;li&gt;Wait for session.ready, then start streaming microphone audio&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Handle incoming events — tool.call, reply.audio, transcript.user, reply.done&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import asyncio
import websockets
import os
import pyaudio

API_KEY = os.getenv("ASSEMBLYAI_API_KEY")
WS_URL = "wss://agents.assemblyai.com/v1/ws"
SAMPLE_RATE = 24000

async def run_agent():
    async with websockets.connect(
        WS_URL,
        additional_headers={"Authorization": f"Bearer {API_KEY}"},
    ) as ws:
        # Configure the agent
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "system_prompt": SYSTEM_PROMPT,
                "greeting": "Hi, this is Avery from Acme support. How can I
help?",
                "output": {"voice": "ivy"},
                "tools": TOOLS,
            },
        }))

        # Set up microphone capture and speaker playback
        pa = pyaudio.PyAudio()
        mic = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                      input=True, frames_per_buffer=1024)
        speaker = pa.open(format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
                          output=True)

        ready = asyncio.Event()
        pending_tools = []

        async def send_audio():
            await ready.wait()
            import base64
            while True:
                audio = mic.read(1024, exception_on_overflow=False)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(audio).decode(),
                }))
                await asyncio.sleep(0)

        async def handle_messages():
            async for raw in ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()
                    print("Agent ready. Start speaking.")

                elif t == "transcript.user":
                    print(f"\nUser: {event['text']}")

                elif t == "transcript.agent":
                    print(f"Agent: {event['text']}")

                elif t == "reply.audio":
                    import base64
                    speaker.write(base64.b64decode(event["data"]))

                elif t == "tool.call":
                    name = event["name"]
                    args = event.get("arguments", {})
                    print(f"  [tool] {name}({args})")
                    result = run_tool(name, args)
                    pending_tools.append({"call_id": event["call_id"], "result": 
result})

                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    elif pending_tools:
                        for tool in pending_tools:
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": json.dumps(tool["result"]),
                            }))
                        pending_tools.clear()

        await asyncio.gather(send_audio(), handle_messages())

if __name__ == "__main__":
    asyncio.run(run_agent())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;A few details that the docs flag and you'd otherwise debug for an hour:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't send tool.result immediately&lt;/strong&gt; when you receive tool.call. Accumulate results and send them inside the reply.done handler. Sending too early causes timing issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discard pending tool results on interruption.&lt;/strong&gt; If the user speaks while the agent is generating a transition phrase, you'll get reply.done with status: "interrupted" — clear the buffer and wait for the next turn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice names are case-sensitive.&lt;/strong&gt; Use lowercase: ivy, claire, dawn. An invalid voice returns session.error.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4: Test the three workflows
&lt;/h2&gt;

&lt;p&gt;Run the script and walk through each support scenario. You should hear:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow 1 — Order lookup:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; You: "Hi, I'm trying to check on order O-R-D 1-2-3-4-5"
Agent: "Sure, let me check on that... I see order ORD-12345. It shipped and is
        on its way — you should have it by Friday, May 9th. The tracking number
        is 1Z999AA10123456784."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Workflow 2 — Email-based account lookup:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; You: "I forgot my order ID. Can you look me up by email?"
Agent: "Of course. What's the email on the account?"
You: "It's jane at example dot com."
Agent: "One moment... Got it, you're Jane Doe. I see two recent orders:
        ORD-12345 from May 1st for $84.99, and ORD-12100 from April 22nd
        for $42.00. Which one are you asking about?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Workflow 3 — Human transfer:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; You: "I just want to talk to a person."
Agent: "I understand. Let me get you over to a teammate now."
[tool.call: transfer_to_human({"reason": "user requested human", "summary": "..."})]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Speak the order ID with hesitation, mumbles, accents, and natural disfluencies — that's where Universal-3 Pro Streaming earns its keep. The agent should still extract the ID correctly because it's tuned for the alphanumeric tokens that voice agents act on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Take it to the phone
&lt;/h2&gt;

&lt;p&gt;This works in your browser through your microphone, but real customer support runs on phones. Twilio Media Streams is the standard bridge — your server accepts the inbound call from Twilio and opens a parallel connection to the Voice Agent API, forwarding audio in both directions.&lt;/p&gt;

&lt;p&gt;The Voice Agent API supports audio/pcmu (G.711 μ-law at 8 kHz) natively, which matches Twilio's codec exactly. No transcoding, no resampling. The &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio" rel="noopener noreferrer"&gt;Twilio integration guide&lt;/a&gt; walks through the full bridge in about 100 lines of TypeScript.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to harden before production
&lt;/h2&gt;

&lt;p&gt;Three things you'll want to nail down before pointing this at real customers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Replace the in-memory mocks&lt;/strong&gt; with calls to your actual CRM or order management system. Add timeouts and error handling so a slow backend doesn't kill the conversation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log everything.&lt;/strong&gt; Save user transcripts, tool calls, results, and the agent's responses tied to a session ID. Conversation logs are your debugging tool when something goes wrong on call #4,712.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tune turn detection for your acoustic environment.&lt;/strong&gt; The defaults work for most use cases. For phone audio with background noise, you may want to raise min_end_of_turn_silence_ms slightly so the agent doesn't cut off thoughtful pauses.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where to go from there
&lt;/h2&gt;

&lt;p&gt;Once the basic order-lookup loop works, the same tool-calling pattern extends to every other support workflow you have: cancel an order, update a shipping address, request a refund, schedule a callback, fetch FAQ answers from a knowledge base. Add the function, describe it in the system prompt, and the agent picks it up — no new infrastructure.&lt;/p&gt;

&lt;p&gt;The compounding win: every conversation goes through the same Voice Agent API connection, the same transcription model, the same billing relationship. You're not assembling a new vendor stack; you're adding tools to an agent that already works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Try the Voice Agent API live&lt;/a&gt; on the product page — it's the same API you'd ship with — or &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;grab a free API key with $50 in starter credits&lt;/a&gt; and have your first agent answering calls by end of day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I build an AI voice agent for customer support that can look up orders?
&lt;/h3&gt;

&lt;p&gt;Build it on AssemblyAI's Voice Agent API, register a get_order_status function as a tool with JSON Schema, and connect to the WebSocket at wss://agents.assemblyai.com/v1/ws. The agent transcribes the customer's speech, decides when to call your function, executes it through your backend, and speaks the result back — all on a single connection. Most developers ship a working agent in an afternoon because there's no SDK to learn and no separate STT, LLM, or TTS providers to wire together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does speech-to-text accuracy matter so much for support voice agents?
&lt;/h3&gt;

&lt;p&gt;Support agents constantly need to capture alphanumeric tokens — order IDs, account numbers, email addresses, phone numbers — and a single transcription error breaks the workflow. If the STT layer mishears "ORD-12345" as "or 12 three 45," your get_order_status function gets a garbled ID and returns nothing. AssemblyAI's Voice Agent API is built on Universal-3 Pro Streaming, which has a 16.7% mixed-entity error rate vs. 23–25% for competing models — that's the difference between tool calls that succeed and tool calls that silently fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does tool calling work with the AssemblyAI Voice Agent API?
&lt;/h3&gt;

&lt;p&gt;You register tools by passing an array of function definitions in session.tools on a session.update event. When the agent decides to call a tool, it emits a tool.call event with the function name and arguments. You execute the function and accumulate results, then send tool.result events inside your reply.done handler — not immediately on tool.call. While the tool runs, the agent speaks a brief transition phrase like "let me check that for you" so the conversation never goes silent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I connect AssemblyAI's Voice Agent API to phone calls with Twilio?
&lt;/h3&gt;

&lt;p&gt;Yes — the Voice Agent API supports audio/pcmu (G.711 μ-law at 8 kHz) natively, which matches Twilio's codec exactly with no transcoding needed. You set up a server that accepts the inbound Twilio Media Streams call, opens a parallel WebSocket to the Voice Agent API, and forwards audio in both directions. The official &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio" rel="noopener noreferrer"&gt;Twilio integration guide&lt;/a&gt; walks through inbound and outbound calling in about 100 lines of TypeScript.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the best way to handle escalation to a human in a customer support voice agent?
&lt;/h3&gt;

&lt;p&gt;Register a transfer_to_human tool with parameters for reason and summary, and instruct the agent in the system prompt to call it when the customer asks for a person, sounds frustrated, or has a complex issue (refund disputes, billing errors, damaged products). The agent generates a short summary of the conversation that you forward to your human queue, so the receiving agent doesn't have to ask the customer to repeat themselves. This is one of the most important workflows to design well — a poor handoff feels worse than no AI at all.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does it cost to run a customer support voice agent on AssemblyAI?
&lt;/h3&gt;

&lt;p&gt;The Voice Agent API is $4.50/hr flat — covering speech understanding, LLM reasoning, voice generation, turn detection, and tool calling all in one bill. There are no per-token surcharges, no concurrency caps, and no separate invoices for STT, LLM, and TTS providers. Pricing is billed by the minute on actual conversation duration, and a free tier is available for testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do voice agents built with AssemblyAI work with healthcare workflows subject to HIPAA?
&lt;/h3&gt;

&lt;p&gt;Yes — AssemblyAI offers a Business Associate Addendum (BAA) for customers processing protected health information (PHI) and is SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certified. For clinical use cases (medical front-office voice agents, healthcare contact centers), enable Medical Mode with domain="medical-v1" to improve transcription accuracy on medication names, procedures, conditions, and dosages. Do not point the agent at real PHI without a signed BAA in place.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>customersupport</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Build a real-time voice AI agent in Python with the AssemblyAI Voice Agent API</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 19 May 2026 19:26:37 +0000</pubDate>
      <link>https://dev.to/martschweiger/build-a-real-time-voice-ai-agent-in-python-with-the-assemblyai-voice-agent-api-4477</link>
      <guid>https://dev.to/martschweiger/build-a-real-time-voice-ai-agent-in-python-with-the-assemblyai-voice-agent-api-4477</guid>
      <description>&lt;p&gt;You can build a working real-time voice agent in Python in well under 100 lines of code if you use the right primitive. This tutorial walks through building one on the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;AssemblyAI Voice Agent API&lt;/a&gt; — a single WebSocket that wraps streaming speech-to-text, an LLM, text-to-speech, turn detection, and tool calling at $4.50/hour flat. No three-provider pipeline to wire up, no separate STT WebSocket plus LLM HTTP plus TTS stream to coordinate. Audio in, audio out, tool calls in between.&lt;/p&gt;

&lt;p&gt;By the end of this guide, you'll have a runnable Python voice agent that listens through your microphone, holds a real conversation, and calls Python functions to take actions. The companion repository is linked at the end. If you'd rather chain streaming STT, an LLM, and a TTS provider yourself, our &lt;a href="https://www.assemblyai.com/blog/python-voice-agent-tutorial" rel="noopener noreferrer"&gt;Python voice agent tutorial&lt;/a&gt; covers that path, or see the &lt;a href="https://www.assemblyai.com/blog/build-a-voice-agent-5-minutes-voice-agent-api" rel="noopener noreferrer"&gt;5-minute Voice Agent API quickstart&lt;/a&gt; for an even faster path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use the Voice Agent API for a Python voice agent
&lt;/h2&gt;

&lt;p&gt;The traditional "voice agent in Python" tutorial wires together a streaming STT API, an LLM HTTP endpoint, and a TTS streaming connection — three providers, three sets of credentials, three sets of latency variables to tune, and your own turn detection logic to write. The result works, but it's a lot of plumbing.&lt;/p&gt;

&lt;p&gt;The Voice Agent API replaces all of that with one WebSocket. You connect once, send audio frames, and receive both audio output and tool call events on the same stream. Three properties make it useful for production Python voice agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One bill, one set of logs.&lt;/strong&gt; $4.50/hour of session time covers STT, LLM inference, TTS, turn detection, and tool calling. You're not pasting three invoices into a cost spreadsheet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech accuracy that works on real audio.&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/universal-3-pro-streaming" rel="noopener noreferrer"&gt;Universal-3 Pro Streaming&lt;/a&gt; sits underneath — 307ms P50 latency, immutable transcripts, native 8kHz mulaw support for telephony, and 21% fewer alphanumeric errors than the previous generation of &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;streaming STT&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calling that maps to Python functions cleanly.&lt;/strong&gt; Define tools as JSON schemas, the LLM calls them, results stream back into the conversation. No separate function-calling API or LLM provider to manage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Microphone
     │  PCM16 24kHz mono
     ▼
  Your Python script
     │  WebSocket: input.audio frames
     ▼
  AssemblyAI Voice Agent API
   ┌────────────────────────────────┐
   │  STT + Turn detection           │
   │      ↓                          │
   │  LLM + tool calling             │
   │      ↓                          │
   │  TTS                            │
   └────────────────────────────────┘
     │
     │  WebSocket: reply.audio + tool.call events
     ▼
  Your Python script
     ├─► Speaker playback
     └─► Dispatch tool calls back to LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Audio flows in two directions on the same WebSocket. Your script captures mic audio, base64-encodes it, and sends it as input.audio events. The API returns audio playback chunks as reply.audio events and structured tool.call events when the LLM decides to invoke one of your tools. You dispatch the tool, send back a tool.result, and the conversation continues.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before you start
&lt;/h2&gt;

&lt;p&gt;You'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;AssemblyAI account&lt;/a&gt; with Voice Agent API access&lt;/li&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;li&gt;A working microphone and speakers (use &lt;strong&gt;headphones&lt;/strong&gt; for clean barge-in — desktop mics pick up the agent's own voice and cause it to interrupt itself)&lt;/li&gt;
&lt;li&gt;portaudio installed system-wide (brew install portaudio on macOS, apt install portaudio19-dev on Debian/Ubuntu)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install the dependencies:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install "websockets&amp;gt;=14" python-dotenv pyaudio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Drop your API key into a .env file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ASSEMBLYAI_API_KEY=your_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Step 1: Capture microphone audio
&lt;/h2&gt;

&lt;p&gt;PyAudio captures raw PCM audio. The Voice Agent API's default audio/pcm encoding is &lt;strong&gt;24 kHz, 16-bit, mono&lt;/strong&gt; — the audio format docs recommend ~50 ms chunks for low latency.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# audio.py
import threading
from queue import Queue
import pyaudio

SAMPLE_RATE = 24000
CHUNK_SIZE = 1200  # 50ms at 24kHz 16-bit mono

class Mic:
    def __init__(self):
        self._pa = pyaudio.PyAudio()
        self.queue = Queue()
        self._running = False

    def start(self):
        self._running = True
        self._stream = self._pa.open(
            format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE,
            input=True, frames_per_buffer=CHUNK_SIZE,
        )
        threading.Thread(target=self._loop, daemon=True).start()

    def _loop(self):
        while self._running:
            self.queue.put(self._stream.read(CHUNK_SIZE, 
exception_on_overflow=False))

    def stop(self):
        self._running = False
        self._stream.stop_stream(); self._stream.close()
        self._pa.terminate()

class Speaker:
    def __init__(self):
        self._pa = pyaudio.PyAudio()
        self._stream = self._open()

    def _open(self):
        return self._pa.open(
            format=pyaudio.paInt16, channels=1, rate=SAMPLE_RATE, output=True,
        )

    def play(self, audio_bytes):
        self._stream.write(audio_bytes)

    def flush_and_restart(self):
        # Called on barge-in: drop any queued speech and reopen the stream.
        try:
            self._stream.stop_stream(); self._stream.close()
        except Exception:
            pass
        self._stream = self._open()

    def close(self):
        self._stream.stop_stream(); self._stream.close()
        self._pa.terminate()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Step 2: Open the Voice Agent API session
&lt;/h2&gt;

&lt;p&gt;The Voice Agent API connection starts with a session.update message that declares your system prompt, the tools you want available, the agent's voice, and an opening greeting. The API picks audio/pcm (24 kHz) by default, so you don't need to specify input/output format explicitly.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# agent.py
import asyncio, base64, json, os
import websockets
from dotenv import load_dotenv

from audio import Mic, Speaker
from tools import TOOLS, dispatch_tool

load_dotenv()

VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws"

SYSTEM_PROMPT = """You are a helpful voice assistant.
Keep replies short and conversational — one or two sentences.
Use the available tools to answer questions when relevant."""

async def open_session(ws):
    await ws.send(json.dumps({
        "type": "session.update",
        "session": {
            "system_prompt": SYSTEM_PROMPT,
            "greeting": "Hi! How can I help?",
            "tools": TOOLS,
            "output": {"voice": "ivy"},
        },
    }))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;A few details worth flagging up front, because they're the easy ones to get wrong:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The auth header for the Voice Agent API uses &lt;strong&gt;Authorization: Bearer YOUR_KEY&lt;/strong&gt; — note the Bearer prefix. This is different from every other AssemblyAI endpoint, which accepts the raw API key with no prefix.&lt;/li&gt;
&lt;li&gt;The first message you send is session.update, not session.start. All config nests under a session object.&lt;/li&gt;
&lt;li&gt;The voice field is a named voice from the Voice Agent API catalog (e.g. ivy, james, sophie) — not an ElevenLabs voice ID. See the &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/voices" rel="noopener noreferrer"&gt;voices reference&lt;/a&gt; for the full list.&lt;/li&gt;
&lt;li&gt;You must wait for the server's session.ready event before sending any audio.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 3: Pump audio in, route events out
&lt;/h2&gt;

&lt;p&gt;Two coroutines run concurrently: one sends mic chunks once the session is ready, the other reads events as they arrive.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async def run_agent():
    mic = Mic()
    speaker = Speaker()

    async with websockets.connect(
        VOICE_AGENT_WS,
        additional_headers={"Authorization": f"Bearer 
{os.environ['ASSEMBLYAI_API_KEY']}"},
    ) as ws:
        await open_session(ws)

        ready = asyncio.Event()
        pending_tools = []
        loop = asyncio.get_event_loop()

        async def send_audio():
            await ready.wait()
            mic.start()
            while True:
                chunk = await loop.run_in_executor(None, mic.queue.get)
                await ws.send(json.dumps({
                    "type": "input.audio",
                    "audio": base64.b64encode(chunk).decode(),
                }))

        async def receive():
            async for raw in ws:
                event = json.loads(raw)
                kind = event["type"]

                if kind == "session.ready":
                    ready.set()
                    print(f"Session ready: {event.get('session_id')}")

                elif kind == "reply.audio":
                    speaker.play(base64.b64decode(event["data"]))

                elif kind == "tool.call":
                    # Accumulate — flush on reply.done, not now.
                    result = dispatch_tool(event["name"], event.get("arguments",
{}))
                    pending_tools.append({"call_id": event["call_id"], "result":
result})

                elif kind == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                        speaker.flush_and_restart()
                    elif pending_tools:
                        for tool in pending_tools:
                            value = tool["result"]
                            if not isinstance(value, str):
                                value = json.dumps(value)
                            await ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": value,
                            }))
                        pending_tools.clear()

                elif kind == "transcript.user":
                    print(f"You:   {event['text']}")

                elif kind == "transcript.agent":
                    print(f"Agent: {event['text']}")

        await asyncio.gather(send_audio(), receive())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That's the entire voice agent loop. The Voice Agent API handles every layer of the pipeline (STT, LLM, TTS, turn detection) inside the WebSocket. Your job is to feed it audio, play what comes back, and dispatch tool calls.&lt;/p&gt;

&lt;p&gt;Two more easy-to-miss details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool result timing.&lt;/strong&gt; Per the &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling" rel="noopener noreferrer"&gt;tool calling docs&lt;/a&gt;, accumulate tool results when tool.call fires and send them inside the reply.done handler — not immediately. The agent generates a short transition phrase ("let me check on that") while the tools run; sending results too early can cause timing issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interruption handling.&lt;/strong&gt; When the user barges in, the server sends reply.done with status: "interrupted". Drop any queued tool results and flush the speaker so the caller doesn't keep hearing the previous reply.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 4: Implement the tools
&lt;/h2&gt;

&lt;p&gt;The dispatch_tool function is where your agent does real work. The Voice Agent API delivers tool.call events with arguments already parsed as a Python dict — no json.loads() needed.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tools.py
TOOLS = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
    {
        "type": "function",
        "name": "remember",
        "description": "Save something the user wants you to remember.",
        "parameters": {
            "type": "object",
            "properties": {"fact": {"type": "string"}},
            "required": ["fact"],
        },
    },
]

_memory = []

def dispatch_tool(name, args):
    if name == "get_weather":
        # In production: call a real weather API.
        return f"It's 68°F and partly cloudy in {args['city']}."
    if name == "remember":
        _memory.append(args["fact"])
        return f"Got it. I'll remember: {args['fact']}"
    return f"Unknown tool: {name}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The "type": "function" field on each tool is required. Forget it and the API will reject the session.update with a validation error.&lt;/p&gt;

&lt;p&gt;In production, replace the stubs with calls to a real weather API, your CRM, a database, or whatever your application actually does. The tool dispatcher is pure Python — anything you can do from a Python function, the voice agent can do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Run it
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python agent.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The agent greets you. Try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"What's the weather in San Francisco?"&lt;/li&gt;
&lt;li&gt;"Remember that my passport expires in March."&lt;/li&gt;
&lt;li&gt;"What did I just ask you to remember?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full flow: your speech → STT → LLM (with tools available) → tool call (if applicable) → tool result → LLM continues → TTS → speaker. All in under a second, on one WebSocket.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency: getting under 500ms perceived
&lt;/h2&gt;

&lt;p&gt;A natural-feeling voice agent responds in under 800ms from when you stop talking to when you hear the reply. Best-in-class teams target sub-500ms. Where your milliseconds go on the Voice Agent API:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Typical latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mic chunk → server&lt;/td&gt;
&lt;td&gt;~50–100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-of-turn detection&lt;/td&gt;
&lt;td&gt;~100–200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM first-token&lt;/td&gt;
&lt;td&gt;~200–400ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTS first-byte → speaker&lt;/td&gt;
&lt;td&gt;~100–250ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Perceived total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~450–950ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Voice Agent API streams audio output as it's generated, so the user hears the first word of the reply while the rest is still being synthesized. The biggest latency wins on the Python side:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't buffer mic audio.&lt;/strong&gt; Send 50ms chunks as they arrive — that's what the audio.py example does.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't block in the tool dispatcher.&lt;/strong&gt; If a tool call takes more than 500ms, the silence becomes audible. Cache hot data, set aggressive timeouts, and consider returning a placeholder ("Let me check on that") while the real call resolves.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the streaming audio output.&lt;/strong&gt; Play reply.audio chunks as they arrive; never wait for the full response.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Handling interruptions
&lt;/h2&gt;

&lt;p&gt;Real conversations include interruptions. The user changes their mind, asks a follow-up while the agent is still talking, or says "wait, no, the other one." The Voice Agent API handles this server-side: barge-in is semantic — back-channels like "uh-huh" don't trigger an interruption, but "wait, stop" does.&lt;/p&gt;

&lt;p&gt;When the user actually interrupts, the server sends reply.done with status: "interrupted" (and transcript.agent with interrupted: true and the trimmed text). Your client should flush any queued speaker audio and drop any pending tool results, exactly as shown in the receive() loop above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going to production
&lt;/h2&gt;

&lt;p&gt;The agent above runs against your local microphone. To deploy it, swap the audio transport:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Phone calls (PSTN)&lt;/strong&gt; — Bridge through Twilio Media Streams. The Voice Agent API supports audio/pcmu (G.711 μ-law at 8 kHz) natively, so phone audio stays in μ-law end-to-end with no resampling. See our our &lt;a href="https://www.assemblyai.com/blog/build-voice-agent-livekit" rel="noopener noreferrer"&gt;LiveKit voice agent guide&lt;/a&gt; if you'd rather use an orchestrator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web apps&lt;/strong&gt; — Capture audio in the browser with AudioWorklet, then stream it to the Voice Agent API. See &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/browser-integration" rel="noopener noreferrer"&gt;Browser integration&lt;/a&gt; for the temporary-token flow that keeps your API key off the client.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile&lt;/strong&gt; — Same pattern. The native audio capture APIs (iOS AVAudioEngine, Android AudioRecord) emit PCM you can forward through your server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For all production deployments, add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Session persistence (save the session_id from session.ready and use session.resume to reconnect within 30 seconds without losing context)&lt;/li&gt;
&lt;li&gt;Per-session structured logs (user transcript, agent transcript, tool calls, tool results)&lt;/li&gt;
&lt;li&gt;PII redaction on transcripts before they hit your warehouse&lt;/li&gt;
&lt;li&gt;A timeout-and-retry policy for tool calls so a slow backend doesn't kill the call&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The complete repository
&lt;/h2&gt;

&lt;p&gt;Fork the runnable Python repo at &lt;a href="https://github.com/kelsey-aai/python-voice-agent-api" rel="noopener noreferrer"&gt;github.com/kelsey-aai/python-voice-agent-api&lt;/a&gt;. It includes mic capture, speaker playback, the WebSocket loop, the tool dispatcher, and example tools you can swap for your own. Around 200 lines of Python end-to-end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I build a real-time voice agent in Python?
&lt;/h3&gt;

&lt;p&gt;The fastest way to build a real-time &lt;a href="https://www.assemblyai.com/blog/ai-voice-agents" rel="noopener noreferrer"&gt;voice agent in Python&lt;/a&gt; in 2026 is to open a WebSocket to the AssemblyAI Voice Agent API at wss://agents.assemblyai.com/v1/ws, stream microphone audio in as input.audio events, and play the reply.audio events you get back. The Voice Agent API handles streaming speech-to-text, the LLM, text-to-speech, turn detection, and tool calling on a single connection at $4.50/hour, so you don't need to wire up three separate providers. With PyAudio for microphone access and the websockets library, the entire agent fits in well under 100 lines of Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between the Voice Agent API and chaining STT-LLM-TTS in Python?
&lt;/h3&gt;

&lt;p&gt;The chained approach uses three providers: a streaming STT API like AssemblyAI Universal-3 Pro Streaming, an LLM like GPT-4o, and a streaming TTS like ElevenLabs. You write the WebSocket bridge, turn detection logic, and retry handling yourself. The Voice Agent API replaces all of that with a single WebSocket — one provider, one bill, one set of logs. Chained pipelines give you finer control over each layer; the Voice Agent API is faster to ship and easier to operate at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I add tool calling to a Python voice agent?
&lt;/h3&gt;

&lt;p&gt;Define tools as JSON schemas in the tools field of your session.update message — each tool needs "type": "function", a name, a description, and a parameter schema. When the LLM decides to call a tool, the Voice Agent API emits a tool.call event on the WebSocket with the tool name, arguments (as a Python dict), and a call_id. Your Python dispatcher runs the actual function, then you send back a tool.result event with that call_id and the result. Send tool results inside your reply.done handler, not immediately on tool.call — the agent speaks a transition phrase while the tools run.&lt;/p&gt;

&lt;h3&gt;
  
  
  How low can latency go on a Python voice agent?
&lt;/h3&gt;

&lt;p&gt;A well-tuned Python voice agent on the Voice Agent API typically lands at 450–950ms perceived latency from end-of-turn to first audio out. The biggest wins are: (1) keep mic chunks small (~50ms) so end-of-turn detection fires fast, (2) don't block in your tool dispatcher — cache and timeout aggressively, and (3) play reply.audio chunks as they arrive instead of buffering. Universal-3 Pro Streaming alone hits 307ms P50 for transcription, which is the floor for the STT layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use a different LLM with the Voice Agent API?
&lt;/h3&gt;

&lt;p&gt;The Voice Agent API ships with frontier-quality LLMs under the hood, selected for low-latency conversational performance. If you specifically need a model that isn't available through the Voice Agent API, you can fall back to a chained architecture where you use AssemblyAI Universal-3 Pro Streaming for the STT layer and bring your own LLM and TTS. Most teams find the Voice Agent API model selection meets their needs and prefer the simpler architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I handle interruptions in a Python voice agent?
&lt;/h3&gt;

&lt;p&gt;The Voice Agent API detects barge-in semantically: back-channels like "uh-huh" don't interrupt, but "wait, stop" does. When the user actually interrupts, the server emits reply.done with status: "interrupted" and transcript.agent with interrupted: true. Your Python client should flush the speaker buffer (close and reopen the PyAudio output stream, or use sounddevice.abort()), drop any pending tool results, and continue listening for the user's new turn. This is what makes interruptions feel natural — the agent stops talking immediately instead of waiting for the previous reply to finish.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to create an AI cold-calling agent with the Voice Agent API</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 19 May 2026 19:26:28 +0000</pubDate>
      <link>https://dev.to/martschweiger/how-to-create-an-ai-cold-calling-agent-with-the-voice-agent-api-70p</link>
      <guid>https://dev.to/martschweiger/how-to-create-an-ai-cold-calling-agent-with-the-voice-agent-api-70p</guid>
      <description>&lt;p&gt;An &lt;a href="https://www.assemblyai.com/solutions/voice-agents" rel="noopener noreferrer"&gt;AI cold-calling agent&lt;/a&gt; placed correctly does 500 lead-qualification calls in parallel for the cost of a single SDR. Placed poorly, it sounds like a robocall and gets hung up on in five seconds. The difference between the two isn't the LLM or the TTS — it's the speech accuracy on phone audio, the turn-taking that decides whether the agent interrupts a hesitant prospect, and the compliance layer that keeps you out of TCPA trouble.&lt;/p&gt;

&lt;p&gt;This tutorial walks through building an AI cold-calling agent on the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;AssemblyAI Voice Agent API&lt;/a&gt; for the conversation layer, with Twilio for outbound dialing. The Voice Agent API gives you one WebSocket for STT, LLM, TTS, turn detection, and tool calling — you don't wire three providers together. You write the outbound dialer, the compliance gate, and the function dispatcher. The companion repository is linked at the end.&lt;/p&gt;

&lt;p&gt;If you're looking for the chained STT + LLM + TTS architecture instead, our &lt;a href="https://www.assemblyai.com/blog/how-to-create-ai-cold-calling-agent" rel="noopener noreferrer"&gt;original AI cold-calling agent guide&lt;/a&gt; covers that path with Universal-3 Pro Streaming directly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What an AI cold-calling agent does
&lt;/h2&gt;

&lt;p&gt;An AI cold-calling agent is an outbound voice AI system that dials a prospect, delivers a pitch in natural conversation, adapts in real time based on what the prospect says, and books qualified meetings or gathers disposition data. Unlike a robocall (one-way recorded message) or a power dialer with a human rep, it conducts a two-way conversation autonomously.&lt;/p&gt;

&lt;p&gt;The use cases where AI cold-calling agents work well today share three traits — high volume, structured pitch, and concrete success criteria (see our &lt;a href="https://www.assemblyai.com/blog/build-voice-agent-outbound-call-assemblyai" rel="noopener noreferrer"&gt;outbound calls walkthrough&lt;/a&gt; for the simpler "agent dials a single number" pattern):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outbound SDR prospecting&lt;/strong&gt; : open with a relevant hook, qualify BANT, book a demo&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Appointment setting&lt;/strong&gt; for field sales, financial advisors, home services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Re-engagement of lapsed leads&lt;/strong&gt; in a CRM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Survey and research calls&lt;/strong&gt; at scale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Event follow-up and RSVP confirmation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Renewal and upsell motions&lt;/strong&gt; for existing customers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread: one script, thousands of conversations, a measurable booking rate or disposition. That's where the Voice Agent API's combination of speech accuracy, tool calling, and flat-rate pricing pays for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  CRM / lead list (Salesforce, HubSpot, CSV)
       │
       ▼
  dialer.py
       │  compliance_gate()  ← TCPA, DNC, state laws, time windows
       ▼
  Twilio outbound dial
       │  TwiML → open Media Stream
       ▼
  bridge_server.py
       │  Twilio Media Stream ↔ Voice Agent API WebSocket
       ▼
  AssemblyAI Voice Agent API
   ┌──────────────────────────────────┐
   │  STT + Turn detection             │
   │      ↓                            │
   │  LLM with sales prompt + tools    │
   │      ↓                            │
   │  TTS                              │
   └──────────────────────────────────┘
       │
       │  tool calls
       ▼
  - book_meeting    (calendar API)
  - log_disposition (CRM update)
  - honor_dnc       (suppression list)
  - mark_callback   (scheduling)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Voice Agent API handles the conversation. Your code handles three things outside the conversation: the &lt;strong&gt;dialer&lt;/strong&gt; (who to call, when, at what concurrency), the &lt;strong&gt;compliance gate&lt;/strong&gt; (TCPA, DNC, state consent), and the &lt;strong&gt;tool dispatcher&lt;/strong&gt; (book a meeting, update the CRM, honor a do-not-call request).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use the Voice Agent API for cold-calling
&lt;/h2&gt;

&lt;p&gt;Three things make the Voice Agent API a strong fit for outbound voice agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speech accuracy on phone audio.&lt;/strong&gt; Cold calls capture emails, phone numbers, company names, and job titles — "five one five, nine eight two, four zero zero zero," "J at acme dot io," "director of rev ops." &lt;a href="https://www.assemblyai.com/universal-3-pro-streaming" rel="noopener noreferrer"&gt;Universal-3 Pro Streaming&lt;/a&gt; (the STT layer under the Voice Agent API) delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. That's the difference between a booked meeting in your calendar and a typo you never catch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calling that maps to the booking moment.&lt;/strong&gt; When a prospect says "yes, Tuesday at 2pm works," the agent has to fire book_meeting immediately — not in the next turn. The Voice Agent API's tool calling is structured-output reliable, which matters when one missed booking is the whole point of the call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Flat $4.50/hour pricing.&lt;/strong&gt; Outbound is bursty by nature. You don't want per-token surprises when the dialer fires 500 simultaneous calls. The Voice Agent API's flat hourly rate covers STT, LLM, TTS, and tool calls all-in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before you start
&lt;/h2&gt;

&lt;p&gt;You'll need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;AssemblyAI account&lt;/a&gt; with Voice Agent API access&lt;/li&gt;
&lt;li&gt;A Twilio account with an outbound-capable phone number (and a verified caller ID if your trial requires it)&lt;/li&gt;
&lt;li&gt;A list of leads with consent to be contacted (CSV is fine for testing — production should integrate your real CRM)&lt;/li&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install fastapi uvicorn "websockets&amp;gt;=14" python-dotenv twilio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Step 1: Build the compliance gate first
&lt;/h2&gt;

&lt;p&gt;Compliance is where AI cold-calling teams burn the most money — TCPA fines run $500–$1,500 per violating call. Build the gate before you write a line of dialer code.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# compliance.py
from datetime import datetime
from zoneinfo import ZoneInfo

DNC_LIST = set(open("suppression.txt").read().split())  # internal DNC

def compliance_gate(lead):
    # 1. Internal suppression (previous DNC requests, unsubscribes)
    if lead["phone"] in DNC_LIST:
        return False, "internal DNC"

    # 2. Federal DNC registry — integrate a real provider in production
    if on_federal_dnc(lead["phone"]):
        return False, "federal DNC"

    # 3. Time window — TCPA bans calls before 8am or after 9pm local
    local_tz = ZoneInfo(lead.get("timezone", "America/New_York"))
    local_hour = datetime.now(local_tz).hour
    if local_hour &amp;lt; 8 or local_hour &amp;gt;= 21:
        return False, f"outside TCPA window ({local_hour}:00 local)"

    # 4. State consent — California, Florida, PA require two-party consent
    if lead.get("state") in {"CA", "FL", "PA", "WA", "IL", "MD", "MT", "NH"}:
        # Agent must disclose recording at the top of the call.
        lead["needs_recording_disclosure"] = True

    return True, "ok"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Build this as a hard gate. No call goes out if any check fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Define the agent's tools
&lt;/h2&gt;

&lt;p&gt;Four tools the agent can call mid-conversation. In production, replace the stubs with real CRM, calendar, and DNC API calls. Each tool needs "type": "function" at the top level — the Voice Agent API validates this on session.update.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# tools.py
TOOLS = [
    {
        "type": "function",
        "name": "book_meeting",
        "description": "Book a meeting on the rep's calendar.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "preferred_time": {"type": "string"},
                "email": {"type": "string"},
            },
            "required": ["lead_id", "preferred_time", "email"],
        },
    },
    {
        "type": "function",
        "name": "log_disposition",
        "description": "Record the call outcome in the CRM.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "disposition": {
                    "type": "string",
                    "enum": ["booked", "not_now", "not_interested",
                             "wrong_person", "left_voicemail", "dnc"],
                },
                "notes": {"type": "string"},
            },
            "required": ["lead_id", "disposition"],
        },
    },
    {
        "type": "function",
        "name": "honor_dnc",
        "description": "Add the prospect to the do-not-call list immediately.",
        "parameters": {
            "type": "object",
            "properties": {"lead_id": {"type": "string"}, "phone": {"type": 
"string"}},
            "required": ["lead_id", "phone"],
        },
    },
    {
        "type": "function",
        "name": "mark_callback",
        "description": "Schedule a callback at the prospect's preferred time.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "preferred_time": {"type": "string"},
            },
            "required": ["lead_id", "preferred_time"],
        },
    },
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The honor_dnc tool is the most important one. If the prospect says anything that sounds like a do-not-call request — "take me off your list," "don't call me again," "remove me" — the agent must call this tool &lt;strong&gt;immediately&lt;/strong&gt; , acknowledge, and end the call politely. No upselling, no "can I just ask one question." TCPA violations on DNC requests are the most expensive mistake a cold-calling agent can make.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Write the system prompt
&lt;/h2&gt;

&lt;p&gt;The system prompt is where the script lives. Four sections every cold-calling prompt needs:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# prompts.py
SYSTEM_PROMPT = """You are an AI sales development representative for Datafold.
You are calling {prospect_name}, {prospect_title} at {prospect_company}.

DISCLOSURE (required):
- Open every call by stating: "Hi {first_name}, this is an AI assistant calling
  on behalf of Datafold."
- This is non-negotiable and legally required in CA, FL, TX, and several other states.

OPENER (15 seconds):
- "I'm reaching out because we help data teams catch breaking changes before
  they hit production. Do you have 30 seconds for me to explain why I'm calling?"
- If yes, continue. If no, ask when's better and call mark_callback.

DISCOVERY (ask only 2 questions, max):
1. "How is your team handling data quality today — manual review, dbt tests,
   or something else?"
2. "How often does a broken model make it to production?"

PITCH (one sentence):
- "Datafold gives data teams CI for their pipelines. Customers like Patreon
  and Faire catch 90% of regressions before they ship."

CTA:
- Offer two specific times in the prospect's time zone.
- Call book_meeting with their email when they accept.

OBJECTION MAP:
- "How did you get my number?" → "You opted in on our website last month."
- "Send me an email" → "Happy to. What's the best address?" (call mark_callback)
- "Not the right person" → "Who handles data quality on your team?"
- "We already use [X]" → "Got it. Most of our customers use [X] alongside Datafold."
- "Not interested" → "No problem. Mind if I ask why?" (then call log_disposition)

DNC HANDLING (highest priority):
- If the prospect says ANYTHING like "take me off your list," "don't call me
  again," "remove me," "stop calling": call honor_dnc IMMEDIATELY, say "Of
  course, you're removed from our list. Sorry to bother you. Have a good day,"
  and end the call. Do NOT try to recover the conversation.

STYLE:
- One or two sentences per turn. Conversational, not formal.
- Listen for tone. If they sound annoyed, wrap up gracefully.
- Never claim to be human. If asked, confirm you're AI.
"""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;That prompt is the entire sales playbook. The Voice Agent API will follow it turn by turn, calling tools when the conversation hits the right moments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Wire up the dialer
&lt;/h2&gt;

&lt;p&gt;The dialer pulls leads from your list, runs each through the compliance gate, and places Twilio calls. It controls concurrency and respects time-of-day rules.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# dialer.py
import asyncio
import csv
import os
from twilio.rest import Client

twilio = Client(os.environ["TWILIO_SID"], os.environ["TWILIO_TOKEN"])

async def dial_lead(lead, callback_url):
    ok, reason = compliance_gate(lead)
    if not ok:
        log_disposition(lead["lead_id"], "skipped", notes=reason)
        return

    call = twilio.calls.create(
        to=lead["phone"],
        from_=os.environ["TWILIO_FROM"],
        url=f"{callback_url}/twilio/voice?lead_id={lead['lead_id']}",
        machine_detection="Enable",  # Hang up on voicemail
        record=True,                  # Required for compliance/QA
    )
    print(f"Dialing {lead['lead_id']}: {call.sid}")

async def run_dialer(leads_csv, max_concurrent=10):
    sem = asyncio.Semaphore(max_concurrent)
    with open(leads_csv) as f:
        leads = list(csv.DictReader(f))

    async def with_limit(lead):
        async with sem:
            await dial_lead(lead, os.environ["PUBLIC_URL"])
            await asyncio.sleep(2)  # pace
    await asyncio.gather(*(with_limit(l) for l in leads))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The machine_detection="Enable" flag tells Twilio to hang up on voicemail rather than wasting a Voice Agent API session on a robot. Important: never leave a recorded message — that's a TCPA violation in most contexts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Bridge Twilio Media Streams to the Voice Agent API
&lt;/h2&gt;

&lt;p&gt;The bridge server is what connects Twilio's outbound call audio to the Voice Agent API WebSocket. Twilio sends G.711 μ-law at 8 kHz; the Voice Agent API accepts it natively when you set the encoding to audio/pcmu.&lt;/p&gt;

&lt;p&gt;A few details that are easy to get wrong on this endpoint specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The auth header is Authorization: Bearer YOUR_KEY — note the &lt;strong&gt;Bearer&lt;/strong&gt; prefix. This is unique to the Voice Agent API; the rest of AssemblyAI accepts the raw key.&lt;/li&gt;
&lt;li&gt;The first WebSocket message is a session.update event with all config nested under a session object. There is no session.start.&lt;/li&gt;
&lt;li&gt;The agent's voice is a named voice from the &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/voices" rel="noopener noreferrer"&gt;Voice Agent API catalog&lt;/a&gt; (ivy, james, sophie, etc.) — not an ElevenLabs voice ID.&lt;/li&gt;
&lt;li&gt;The telephony audio encoding is audio/pcmu (G.711 μ-law). Sample rate is implicit (8 kHz). Don't pass pcm_mulaw or a sample_rate field — the API ignores them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You must wait for session.ready before sending any input.audio frames.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# bridge_server.py
import asyncio, json, os
import websockets
from fastapi import FastAPI, Query, Request, WebSocket
from fastapi.responses import Response

from prompts import SYSTEM_PROMPT
from tools import TOOLS, dispatch_tool

VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws"
ASSEMBLYAI_KEY = os.environ["ASSEMBLYAI_API_KEY"]

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice(request: Request, lead_id: str = Query(...)):
    host = request.url.hostname
    twiml = f"""&amp;lt;?xml version="1.0" encoding="UTF-8"?&amp;gt;
&amp;lt;Response&amp;gt;
  &amp;lt;Connect&amp;gt;
    &amp;lt;Stream url="wss://{host}/media-stream?lead_id={lead_id}" /&amp;gt;
  &amp;lt;/Connect&amp;gt;
&amp;lt;/Response&amp;gt;"""
    return Response(content=twiml, media_type="application/xml")

@app.websocket("/media-stream")
async def media_stream(twilio_ws: WebSocket, lead_id: str = Query(...)):
    await twilio_ws.accept()
    lead = LEAD_CACHE[lead_id]
    stream_sid = {"value": None}

    session_config = {
        "type": "session.update",
        "session": {
            "system_prompt": SYSTEM_PROMPT.format(**lead),
            "tools": TOOLS,
            "input": {"format": {"encoding": "audio/pcmu"}},
            "output": {
                "voice": "ivy",
                "format": {"encoding": "audio/pcmu"},
            },
        },
    }

    async with websockets.connect(
        VOICE_AGENT_WS,
        additional_headers={"Authorization": f"Bearer {ASSEMBLYAI_KEY}"},
    ) as va_ws:
        await va_ws.send(json.dumps(session_config))

        ready = asyncio.Event()
        pending_tools = []

        async def pump_twilio_to_va():
            async for raw in twilio_ws.iter_text():
                event = json.loads(raw)
                kind = event.get("event")
                if kind == "start":
                    stream_sid["value"] = event["start"]["streamSid"]
                elif kind == "media":
                    if not ready.is_set():
                        continue
                    # Twilio sends base64 mulaw; AAI accepts it directly.
                    await va_ws.send(json.dumps({
                        "type": "input.audio",
                        "audio": event["media"]["payload"],
                    }))
                elif kind == "stop":
                    return

        async def pump_va_to_twilio():
            async for raw in va_ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()

                elif t == "reply.audio" and stream_sid["value"]:
                    await twilio_ws.send_text(json.dumps({
                        "event": "media",
                        "streamSid": stream_sid["value"],
                        "media": {"payload": event["data"]},
                    }))

                elif t == "tool.call":
                    result = dispatch_tool(event["name"], event.get("arguments",
{}))
                    pending_tools.append({"call_id": event["call_id"], "result":
result})

                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    else:
                        for tool in pending_tools:
                            value = tool["result"]
                            if not isinstance(value, str):
                                value = json.dumps(value)
                            await va_ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": value,
                            }))
                        pending_tools.clear()

                elif t == "transcript.user":
                    print(f"[{lead_id}] User: {event['text']}")
                elif t == "transcript.agent":
                    print(f"[{lead_id}] Agent: {event['text']}")

        await asyncio.gather(pump_twilio_to_va(), pump_va_to_twilio())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Two subtleties worth understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tool result timing.&lt;/strong&gt; Per the &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling" rel="noopener noreferrer"&gt;tool calling docs&lt;/a&gt;, accumulate tool results when tool.call fires and send them inside reply.done — not immediately. The agent speaks a transition phrase ("let me check") while the tools run; sending too early causes timing issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audio pass-through.&lt;/strong&gt; Twilio's media.payload and AssemblyAI's input.audio.audio (and reply.audio.data) are all base64-encoded μ-law strings, so the bridge moves bytes through without any decode/re-encode step.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Compliance: the part most teams underweight
&lt;/h2&gt;

&lt;p&gt;Three things separate a working AI cold-calling agent from a $50,000 TCPA settlement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scrub against the federal DNC registry&lt;/strong&gt; before every call. Integrate a real provider — DNC.gov has a paid programmatic feed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Honor state DNC lists.&lt;/strong&gt; Several states maintain their own — California, Pennsylvania, Indiana, Tennessee. Your scrub vendor should cover these.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two-party consent disclosure.&lt;/strong&gt; In CA, FL, PA, WA, and several other states, you must disclose at the top of the call that the call is being recorded and that the caller is AI. Your system prompt's DISCLOSURE section is doing this work — never remove it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build all three as hard gates. If any check fails, the call doesn't go out. Log every disposition with a timestamp so you can prove compliance during an audit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring success
&lt;/h2&gt;

&lt;p&gt;Three numbers tell you whether your AI cold-calling agent is working (see our broader &lt;a href="https://www.assemblyai.com/blog/ai-voice-agents" rel="noopener noreferrer"&gt;AI voice agents guide&lt;/a&gt; for context on conversion metrics across use cases):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection rate&lt;/strong&gt; : percentage of calls that reach a live human. Healthy: 30–50% with a local-presence dialer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation rate&lt;/strong&gt; : percentage of connected calls that last more than 30 seconds. Healthy: 25–40%.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Book rate&lt;/strong&gt; : percentage of conversations that end in a booked meeting. Healthy: 5–15% for warm/intent leads, 1–3% for cold lists.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Read every transcript for the first 500 calls. You'll catch prompt failures, silently wrong transcriptions on company names, and tool-call timing issues that you'd never notice listening to the audio.&lt;/p&gt;

&lt;h2&gt;
  
  
  The complete repository
&lt;/h2&gt;

&lt;p&gt;Fork the runnable repo at &lt;a href="https://github.com/kelsey-aai/cold-calling-voice-agent-api" rel="noopener noreferrer"&gt;github.com/kelsey-aai/cold-calling-voice-agent-api&lt;/a&gt;. It includes the dialer, the compliance gate, the bridge server, the tool dispatcher, the system prompt, and a sample leads.csv. Around 400 lines of Python total.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I create an AI cold-calling agent with the Voice Agent API?
&lt;/h3&gt;

&lt;p&gt;To create an AI cold-calling agent with the AssemblyAI Voice Agent API, build four pieces: a dialer that pulls leads from your CRM and places outbound Twilio calls, a compliance gate that scrubs against DNC registries and TCPA time windows, a bridge server that connects Twilio Media Streams to the Voice Agent API WebSocket at wss://agents.assemblyai.com/v1/ws, and a tool dispatcher with book_meeting, log_disposition, honor_dnc, and mark_callback. Define a sales-specific system prompt with disclosure, opener, discovery, pitch, CTA, objection map, and DNC handling rules. The Voice Agent API handles the conversation — your code handles dialing, compliance, and integrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is AI cold-calling legal?
&lt;/h3&gt;

&lt;p&gt;AI cold-calling is legal in most U.S. jurisdictions if you comply with TCPA (federal), state-level consent laws, and disclose that the caller is AI. Specifically: scrub against the federal DNC registry before every call, respect TCPA calling windows (no calls before 8am or after 9pm in the recipient's local time), get two-party consent for recording in states that require it (CA, FL, PA, WA, and others), and disclose AI identity at the top of the call. The cost of getting this wrong is steep — $500–$1,500 per violating call. Build the compliance gate as a hard barrier and consult legal counsel before scaling.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does it cost to run an AI cold-calling agent?
&lt;/h3&gt;

&lt;p&gt;On the AssemblyAI Voice Agent API, you pay $4.50/hour of session time — STT, LLM, TTS, turn detection, and tool calls included. Twilio outbound voice adds a few cents per minute. A typical 90-second qualification call costs roughly $0.12–$0.18 all-in. At the typical 30–50% connection rate, the cost per actual conversation is closer to $0.30. Compare against a human SDR at fully-loaded $70–100/hour and the unit economics generally favor the agent for high-volume top-of-funnel motions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What speech-to-text accuracy do I need for cold-calling?
&lt;/h3&gt;

&lt;p&gt;The accuracy that matters for cold-calling is &lt;strong&gt;alphanumeric accuracy on phone audio&lt;/strong&gt; — capturing emails, phone numbers, company names, and job titles correctly the first time. Universal-3 Pro Streaming, which is the STT layer under the Voice Agent API, delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. That accuracy is the difference between booking a meeting in the rep's calendar (&lt;a href="mailto:alex@acme.io"&gt;alex@acme.io&lt;/a&gt;) and a typo your CRM never catches (&lt;a href="mailto:alec@akme.io"&gt;alec@akme.io&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Can the Voice Agent API place outbound calls directly?
&lt;/h3&gt;

&lt;p&gt;Today, you use Twilio (or another telephony provider) for the outbound dial, and bridge the resulting Media Stream into the Voice Agent API WebSocket. The Voice Agent API handles the conversation; Twilio handles the PSTN connection and the audio transport. Native outbound dialing through the Voice Agent API is on the roadmap — the bridge pattern in this tutorial is the standard path today, and the code in the companion repo handles it cleanly in about 100 lines.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>telephony</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Multi-language voice agents: Building agents that speak to anyone</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 19 May 2026 19:25:44 +0000</pubDate>
      <link>https://dev.to/martschweiger/multi-language-voice-agents-building-agents-that-speak-to-anyone-40fk</link>
      <guid>https://dev.to/martschweiger/multi-language-voice-agents-building-agents-that-speak-to-anyone-40fk</guid>
      <description>&lt;p&gt;Building multilingual &lt;a href="https://www.assemblyai.com/blog/ai-voice-agents" rel="noopener noreferrer"&gt;voice agents&lt;/a&gt; requires coordinating four critical components—speech-to-text, language models, text-to-speech, and orchestration software—all working together within strict timing constraints to maintain natural conversation flow. The challenge isn't just connecting these pieces; each component must handle multiple languages, accents, and real-time language switching while keeping responses under one second.&lt;/p&gt;

&lt;p&gt;This guide walks you through the &lt;a href="https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents" rel="noopener noreferrer"&gt;technical architecture&lt;/a&gt;, performance requirements, and implementation considerations for production multilingual voice agents. You'll learn how to handle automatic language detection, manage code-switching scenarios where users mix languages mid-sentence, and build systems that maintain conversation context across language transitions—essential knowledge for creating voice experiences that truly work for global audiences.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the core components of a multilingual voice agent?
&lt;/h2&gt;

&lt;p&gt;A multilingual voice agent is an AI system that listens to speech in multiple languages, understands what you're saying, and responds back in natural conversation. This means it can handle a customer service call where someone starts speaking Spanish, switches to English for technical terms, then back to Spanish—all in real-time.&lt;/p&gt;

&lt;p&gt;You need four components working together: speech-to-text converts your voice to text, language models understand and generate responses, text-to-speech converts responses back to speech, and orchestration software coordinates everything within milliseconds.&lt;/p&gt;

&lt;p&gt;The challenge isn't just connecting these pieces. Each component must handle multiple languages while keeping the conversation feeling natural and fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speech-to-text for multilingual support
&lt;/h3&gt;

&lt;p&gt;Speech-to-text (STT) is the foundation that converts spoken words into text that AI models can understand. This means turning "¿Puedes ayudarme?" into text that the system can process, regardless of accent or speaking speed.&lt;/p&gt;

&lt;p&gt;You have two main processing options: &lt;a href="https://www.assemblyai.com/blog/introducing-multilingual-universal-streaming" rel="noopener noreferrer"&gt;streaming transcription&lt;/a&gt; that processes speech as you speak, and batch processing that waits for complete sentences. Voice agents need streaming transcription because users expect responses before they finish talking.&lt;/p&gt;

&lt;p&gt;Here's what makes multilingual STT challenging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Language detection:&lt;/strong&gt; The system must identify which language you're speaking within seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accent handling:&lt;/strong&gt; Spanish from Mexico sounds different from Spanish from Argentina&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code-switching:&lt;/strong&gt; When you mix languages mid-sentence like "Can you check mi cuenta"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your speech-to-text gets "schedule appointment" wrong as "cancel appointment," even perfect AI models downstream can't fix that error.&lt;/p&gt;

&lt;h3&gt;
  
  
  Language models and multilingual reasoning
&lt;/h3&gt;

&lt;p&gt;Language models take the transcribed text and figure out what you actually want, then generate appropriate responses. Large Language Models (LLMs) handle multiple languages through two approaches: translating everything to one language internally, or processing multiple languages directly.&lt;/p&gt;

&lt;p&gt;Direct multilingual processing works better because it keeps cultural context intact. "How can I help you?" and "¿En qué puedo ayudarle?" aren't just translations—they carry different levels of formality that matter for customer experience.&lt;/p&gt;

&lt;p&gt;Your language model also needs to remember context when you switch languages. If you start in Spanish, use English technical terms, then return to Spanish, the model must follow along without losing track of what you're trying to accomplish.&lt;/p&gt;

&lt;h3&gt;
  
  
  Text-to-speech synthesis across languages
&lt;/h3&gt;

&lt;p&gt;Text-to-speech (TTS) turns the AI's written response back into natural speech. This isn't just pronunciation—it's matching the rhythm, emotion, and cultural tone appropriate for each language.&lt;/p&gt;

&lt;p&gt;Modern TTS systems offer multiple voice options per language:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Demographics:&lt;/strong&gt; Different ages, genders, and speaking styles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regional accents:&lt;/strong&gt; British vs American English, European vs Latin American Spanish&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tone matching:&lt;/strong&gt; Professional for banking, casual for shopping, empathetic for support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some languages create unique challenges. Mandarin uses pitch to change word meaning, while Arabic connects words in complex ways that affect pronunciation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-time orchestration and coordination
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/blog/orchestration-tools-ai-voice-agents" rel="noopener noreferrer"&gt;Orchestration software&lt;/a&gt; acts like air traffic control for your voice agent. This means managing timing between components, handling interruptions when users start speaking again, and keeping conversation state—all while staying under one second response time.&lt;/p&gt;

&lt;p&gt;Think of orchestration as the conductor making sure your voice agent doesn't talk over users, doesn't lose context, and recovers gracefully from errors.&lt;/p&gt;

&lt;p&gt;Key responsibilities include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline management:&lt;/strong&gt; Moving data smoothly between STT, LLM, and TTS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interruption handling:&lt;/strong&gt; Stopping playback when users interrupt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State tracking:&lt;/strong&gt; Remembering conversation history and language preferences&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error recovery:&lt;/strong&gt; Handling network issues without breaking the conversation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What are the performance requirements for multilingual voice agents?
&lt;/h2&gt;

&lt;p&gt;Users expect voice agents to respond within one second of finishing their sentence. Anything longer makes conversations feel awkward and unnatural.&lt;/p&gt;

&lt;p&gt;Here's where that crucial second gets spent:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Time used&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speech-to-text&lt;/td&gt;
&lt;td&gt;200–400ms&lt;/td&gt;
&lt;td&gt;Converting your speech to text&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM processing&lt;/td&gt;
&lt;td&gt;100–300ms&lt;/td&gt;
&lt;td&gt;Understanding and generating response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text-to-speech&lt;/td&gt;
&lt;td&gt;300–600ms&lt;/td&gt;
&lt;td&gt;Converting response to speech&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network overhead&lt;/td&gt;
&lt;td&gt;50–100ms&lt;/td&gt;
&lt;td&gt;Data moving between systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total target&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Under 1000ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Must stay under one second&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Multilingual support makes these targets harder to hit. Language detection adds time, some languages process slower than others, and translation (when needed) creates additional delays.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency requirements for conversational quality
&lt;/h3&gt;

&lt;p&gt;The one-second rule comes from natural human conversation patterns. People typically pause 200–500ms before responding, so a voice agent responding in 800ms feels natural while 1500ms creates awkward silence.&lt;/p&gt;

&lt;p&gt;But perceived speed matters more than actual speed. If your agent starts responding quickly—even with "Let me check that for you"—users perceive faster service than an agent that stays silent for 800ms then gives a complete answer.&lt;/p&gt;

&lt;p&gt;Streaming helps here. Instead of waiting for complete responses, you can start speaking as soon as the first few words are ready. This cuts perceived latency by 30–40% while keeping the same actual processing time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Accuracy requirements across languages and accents
&lt;/h3&gt;

&lt;p&gt;You need at least 90% word accuracy across all supported languages for reliable voice agents. The challenge? That 90% must work for English speakers from Boston, Spanish speakers from Mexico, and Mandarin speakers from Beijing—not just clear, neutral accents.&lt;/p&gt;

&lt;p&gt;Errors compound through your pipeline. If speech-to-text achieves 85% accuracy and your language model correctly interprets 90% of that text, you're down to 76% end-to-end accuracy. That's barely better than guessing for complete interactions.&lt;/p&gt;

&lt;p&gt;Critical accuracy areas include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Names and addresses:&lt;/strong&gt; Personal information must be captured exactly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Numbers:&lt;/strong&gt; Account numbers, phone numbers, and dollar amounts can't have errors&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent preservation:&lt;/strong&gt; The core request must survive even if some words are wrong&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;High-quality speech-to-text models like AssemblyAI's Universal-2 model support 99 languages with industry-leading accuracy, creating a reliable foundation when errors can't be tolerated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key implementation considerations
&lt;/h2&gt;

&lt;p&gt;Moving from prototype to production means solving practical challenges that don't show up in demos. These details often determine whether your voice agent delights users or frustrates them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Language detection and real-time switching
&lt;/h3&gt;

&lt;p&gt;Automatic language detection sounds straightforward—identify the language and proceed. Real conversations are messier. Users greet in one language then switch to another, use technical English terms while speaking Spanish, or have accents that confuse detection.&lt;/p&gt;

&lt;p&gt;Most successful systems use a hybrid approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Initial detection:&lt;/strong&gt; Identify language from the first 2–3 seconds of speech&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence scoring:&lt;/strong&gt; Avoid false switches when detection isn't certain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context clues:&lt;/strong&gt; Use user profiles or phone number regions as hints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trickiest scenario? &lt;a href="https://www.assemblyai.com/blog/real-time-transcription-code-switches-multilingual-speakers" rel="noopener noreferrer"&gt;Code-switching&lt;/a&gt; where users naturally mix languages mid-sentence. "Can you check mi cuenta, I think there's a problem" requires handling English and Spanish simultaneously without breaking conversation flow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing multilingual voice agent accuracy
&lt;/h3&gt;

&lt;p&gt;Testing multilingual voice agents requires systematic validation across language combinations, not just individual languages. A system perfect in English and Spanish separately might fail when users switch between them.&lt;/p&gt;

&lt;p&gt;Start with single-language testing using native speakers with various accents and natural speaking styles. Record actual conversations, not scripted readings—natural speech includes hesitations, corrections, and informal phrases that scripts miss.&lt;/p&gt;

&lt;p&gt;Then test language transitions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mixed conversations:&lt;/strong&gt; Spanish speakers using English product names&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical explanations:&lt;/strong&gt; Users switching languages to explain complex issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cultural context:&lt;/strong&gt; Different communication styles across cultures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Essential testing scenarios include accent variations across regions, background noise from realistic environments, different speaking speeds, and code-switching patterns common in your user base.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common use cases for multilingual voice agents
&lt;/h2&gt;

&lt;p&gt;Multilingual voice agents excel where businesses need to serve diverse populations efficiently. Here are three high-impact applications you're likely to encounter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customer support automation
&lt;/h3&gt;

&lt;p&gt;Customer support represents the biggest deployment of multilingual voice agents today. These systems handle routine requests—password resets, balance checks, order tracking—in dozens of languages without requiring multilingual human agents for every shift.&lt;/p&gt;

&lt;p&gt;Success depends on seamless escalation to humans. When the voice agent can't resolve your issue, it must transfer you to a human agent while preserving conversation context and language preference. Nobody wants to repeat their problem in a different language.&lt;/p&gt;

&lt;p&gt;Integration with existing systems matters here. The voice agent needs access to your account information and ability to update records in real-time. This means a Spanish-speaking customer can check order status, update delivery addresses, and receive confirmation without waiting for a Spanish-speaking human agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Voice assistants for global applications
&lt;/h3&gt;

&lt;p&gt;Consumer apps use multilingual voice assistants to reach global markets. Think banking apps that let you check balances, transfer money, or report lost cards through voice commands in your preferred language.&lt;/p&gt;

&lt;p&gt;These applications need cultural adaptation beyond translation. A voice assistant in Japan should understand indirect communication styles, while one in New York can be more direct. The same request gets phrased completely differently based on cultural expectations.&lt;/p&gt;

&lt;p&gt;Privacy becomes critical with sensitive financial or personal information. Your voice agent must handle this data across different regulatory environments while maintaining consistent service quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contact center automation
&lt;/h3&gt;

&lt;p&gt;Enterprise &lt;a href="https://www.assemblyai.com/solutions/contact-centers" rel="noopener noreferrer"&gt;contact centers&lt;/a&gt; deploy multilingual voice agents to handle peak call volumes and provide 24/7 coverage. Instead of staffing overnight shifts with multilingual agents, you deploy voice agents that handle routine calls in any supported language.&lt;/p&gt;

&lt;p&gt;The business case is clear: one multilingual voice agent replaces dozens of language-specific phone menu systems while providing better service. Callers get natural conversation instead of pressing buttons through complex menus.&lt;/p&gt;

&lt;p&gt;Compliance considerations vary by industry and caller location. Your voice agent must adapt its behavior for call recording requirements, data retention rules, and disclosure obligations based on applicable regulations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final words
&lt;/h2&gt;

&lt;p&gt;Building reliable multilingual voice agents requires coordinating speech-to-text, language models, text-to-speech, and orchestration—all working within tight timing constraints that keep conversations natural. Your foundation starts with accurate speech recognition, because transcription errors cascade through every step, turning helpful interactions into frustrated customers.&lt;/p&gt;

&lt;p&gt;The implementation challenges we've covered show why thoughtful architecture matters more than raw technology. With accurate transcription as your starting point, you can build voice agents that truly communicate with anyone, anywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What components do I need to build a multilingual voice agent?
&lt;/h3&gt;

&lt;p&gt;You need four integrated components: speech-to-text for converting speech to text, language models for understanding and generating responses, text-to-speech for voice synthesis, and orchestration software to coordinate everything in real-time within one second.&lt;/p&gt;

&lt;h3&gt;
  
  
  How quickly do multilingual voice agents need to respond?
&lt;/h3&gt;

&lt;p&gt;Target under 1000ms end-to-end latency for natural conversation flow. This includes 200–400ms for speech-to-text, 100–300ms for language model processing, and 300–600ms for text-to-speech synthesis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can voice agents detect language automatically during conversations?
&lt;/h3&gt;

&lt;p&gt;Yes, modern speech-to-text models detect language within the first 2–3 seconds of speech and can handle language switches mid-conversation. The system maintains conversation context across language changes without requiring users to specify their language preference.&lt;/p&gt;

&lt;h3&gt;
  
  
  What speech accuracy do I need for multilingual voice agents?
&lt;/h3&gt;

&lt;p&gt;Aim for at least 90% word accuracy across all supported languages and accents. Lower accuracy causes errors to compound through the pipeline, reducing end-to-end reliability below acceptable thresholds for production deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I test multilingual voice agent performance before launch?
&lt;/h3&gt;

&lt;p&gt;Test systematically with native speakers across regional accents, speaking speeds, and background noise conditions. Validate both single-language accuracy and language-switching scenarios, measuring word error rates, intent recognition, and task completion rates.&lt;/p&gt;

&lt;h3&gt;
  
  
  What infrastructure supports multilingual voice agents at scale?
&lt;/h3&gt;

&lt;p&gt;You need &lt;a href="https://www.assemblyai.com/blog/choosing-a-stt-api-for-voice-agents" rel="noopener noreferrer"&gt;streaming speech-to-text APIs&lt;/a&gt;, multilingual language model services, text-to-speech capabilities, and orchestration platforms that handle concurrent conversations. The infrastructure must scale horizontally without degrading response times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do multilingual voice agents handle mixed-language conversations?
&lt;/h3&gt;

&lt;p&gt;Advanced speech-to-text models can transcribe code-switching where speakers mix languages mid-sentence. Success depends on training data that includes natural bilingual speech patterns and systems designed to maintain context across language transitions.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>i18n</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Build a voice agent for telehealth triage</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 19 May 2026 19:25:34 +0000</pubDate>
      <link>https://dev.to/martschweiger/build-a-voice-agent-for-telehealth-triage-33j</link>
      <guid>https://dev.to/martschweiger/build-a-voice-agent-for-telehealth-triage-33j</guid>
      <description>&lt;h1&gt;
  
  
  Build a voice agent for telehealth triage
&lt;/h1&gt;

&lt;p&gt;A telehealth triage voice agent answers a patient's call, captures symptoms in their own words, scores severity against a defined protocol, and routes the patient to the right care level — emergency, urgent care, virtual visit, or self-care guidance. It doesn't diagnose, doesn't prescribe, and doesn't decide; it triages, in the same way an experienced nurse on a phone line would, then hands off with structured notes attached.&lt;/p&gt;

&lt;p&gt;This tutorial walks through building one on the AssemblyAI Voice Agent API with a clinical-specialty prompt and the architectural controls HIPAA requires — encrypted audio, BAA-backed deployment, PII redaction, and audit logging. We'll cover the triage protocol, symptom capture, severity scoring with tool calls, and the handoff that gets the patient to the right next step. The companion repository is linked at the end.&lt;/p&gt;

&lt;p&gt;This is a triage agent, not a clinical decision-maker. Everything in this guide assumes a human clinician makes the final call — the voice agent's job is to capture the data, run the protocol, and route the patient.&lt;/p&gt;

&lt;h2&gt;
  
  
  What telehealth triage looks like as a voice agent
&lt;/h2&gt;

&lt;p&gt;A triage call follows a predictable structure. The agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Greets the patient and confirms identity (name, date of birth)&lt;/li&gt;
&lt;li&gt;Asks for the chief complaint in the patient's own words&lt;/li&gt;
&lt;li&gt;Walks through a symptom protocol (when did it start, severity, associated symptoms)&lt;/li&gt;
&lt;li&gt;Captures red-flag symptoms that escalate severity&lt;/li&gt;
&lt;li&gt;Calls a score_severity tool that runs the captured symptoms through a triage algorithm&lt;/li&gt;
&lt;li&gt;Routes the patient — ER (911), urgent care, scheduled visit, or self-care&lt;/li&gt;
&lt;li&gt;Logs structured notes to the EHR for the receiving clinician&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This pattern works for telehealth voice agents because it has a defined protocol, concrete success criteria (was the patient routed correctly?), and a clear failure mode (escalate to a human nurse if anything is unclear). It's not asking the voice agent to diagnose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use the Voice Agent API for telehealth triage
&lt;/h2&gt;

&lt;p&gt;Three properties matter specifically for healthcare:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speech accuracy on medical terminology.&lt;/strong&gt; Patients say "metoprolol" and "lisinopril" and "I have a history of A-fib." A model that mishears any of these creates a downstream safety issue. &lt;a href="https://www.assemblyai.com/universal-3-pro-streaming" rel="noopener noreferrer"&gt;Universal-3 Pro Streaming&lt;/a&gt;, the STT layer under the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt;, performs strongly on medical conversations; for post-call note generation and billing-grade documentation, AssemblyAI's &lt;a href="https://www.assemblyai.com/docs/pre-recorded-audio/medical-mode" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt; async API is purpose-built for clinical terminology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BAA-backed deployment for processing PHI.&lt;/strong&gt; AssemblyAI enables covered entities and their business associates subject to HIPAA to use AssemblyAI services to process protected health information (PHI), and offers a Business Associate Addendum (BAA) required under HIPAA. Without a BAA you legally cannot route PHI through the service, regardless of how good the model is. &lt;a href="https://www.assemblyai.com/contact/sales" rel="noopener noreferrer"&gt;Contact our sales team&lt;/a&gt; to execute a BAA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool calling for protocolized triage.&lt;/strong&gt; The triage protocol lives in tool calls — score_severity, route_to_care_level, schedule_callback, escalate_to_nurse. The agent calls tools rather than generating free-form clinical guidance, which is what keeps the system inside the bounds of triage and out of the bounds of diagnosis.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  Patient call (PSTN via Twilio, or telehealth app)
        │
        ▼
  Voice Agent API (one WebSocket)
   ┌────────────────────────────────────┐
   │  Universal-3 Pro Streaming (STT)    │
   │     ↓                               │
   │  LLM with triage protocol           │
   │     ↓                               │
   │  TTS                                │
   └────────────────────────────────────┘
        │
        │  tool calls
        ▼
   Tool dispatcher
    - capture_symptom         (structured)
    - score_severity          (runs triage algorithm)
    - route_to_care_level     (ER / urgent / scheduled / self-care)
    - escalate_to_nurse       (live RN handoff)
    - log_to_ehr              (encrypted PHI write)

  (post-call)
        │
        ▼
   Async Medical Mode API
   - billing-grade SOAP note
   - ICD-10 candidate codes
   - quality review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The Voice Agent API runs the patient-facing conversation. The protocol logic lives in your tools. Post-call documentation goes through the async Medical Mode API for clinical-quality notes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before you start
&lt;/h2&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;AssemblyAI account&lt;/a&gt; — for healthcare deployments, &lt;a href="https://www.assemblyai.com/contact/sales" rel="noopener noreferrer"&gt;contact our sales team&lt;/a&gt; to execute a BAA before processing any PHI&lt;/li&gt;
&lt;li&gt;A defined triage protocol from your clinical team. This guide uses a simplified version for illustration; your real protocol should come from licensed clinicians and be reviewed against ESI (Emergency Severity Index) or your organization's equivalent&lt;/li&gt;
&lt;li&gt;An EHR integration target (Epic, Cerner, athena, custom)&lt;/li&gt;
&lt;li&gt;A licensed RN available for live escalations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Don't deploy a telehealth triage agent into production without (1) a BAA executed with AssemblyAI, (2) clinical review of every prompt and tool, (3) an always-available escalation path to a human nurse, and (4) IRB or compliance review per your organization's policies. The agent in this tutorial is a working starter — not a production-ready clinical system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Define the triage protocol in the system prompt
&lt;/h2&gt;

&lt;p&gt;The system prompt is where the protocol lives. Three rules that make the difference between a triage agent and a chatbot:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SYSTEM_PROMPT = """You are an AI telehealth triage assistant for ACME Health.

You are NOT a doctor. You do NOT diagnose. You do NOT prescribe. Your job is
to capture symptoms, run a triage protocol, and route the patient to the
right care level. A licensed clinician makes the final decision.

CALL FLOW:
1. Greet the patient. Confirm name and date of birth.
2. Ask the chief complaint in their own words. Capture it verbatim using
   capture_symptom(category='chief_complaint', detail=...).
3. Walk through the OPQRST protocol:
   - Onset (when did it start?)
   - Provocation/Palliation (what makes it worse or better?)
   - Quality (sharp, dull, throbbing?)
   - Region/Radiation (where, does it spread?)
   - Severity (1–10)
   - Timing (constant, intermittent?)
   Call capture_symptom for each.
4. Screen for red flags relevant to the complaint:
   - Chest pain / shortness of breath / arm pain → cardiac red flags
   - Severe headache / vision changes / weakness → stroke red flags
   - High fever / stiff neck → meningitis red flags
   - Severe abdominal pain / blood → surgical red flags
   - Suicidal ideation → mental health red flags
   If ANY red flag is present, call escalate_to_nurse IMMEDIATELY and
   say: "These symptoms need immediate attention. I'm connecting you to
   our on-call nurse right now."
5. Call score_severity with all captured symptoms.
6. Based on the result, call route_to_care_level with the recommendation.

CRITICAL RULES:
- Never tell the patient what they have. Use "your symptoms suggest..." not
  "you have...".
- Never recommend medication or dosage changes.
- If the patient asks medical questions outside triage, say:
  "I can't answer that. Let me connect you with our nurse line."
  and call escalate_to_nurse.
- If you're uncertain at any point, escalate.

STYLE:
- Speak calmly. One or two sentences per turn.
- Use plain language, not medical jargon. "Pressure in your chest" not
  "thoracic discomfort".
- Confirm critical details back: "You said the pain started Tuesday — is
  that right?"
"""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The escalate-on-uncertainty rule is the most important. A triage agent that confidently routes a heart attack to "schedule a visit" is dangerous. One that escalates to a human nurse the moment red flags appear is safe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Define the tools
&lt;/h2&gt;

&lt;p&gt;Each tool needs "type": "function" at the top level — the Voice Agent API validates this on session.update.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TOOLS = [
    {
        "type": "function",
        "name": "capture_symptom",
        "description": "Record a symptom or piece of OPQRST data.",
        "parameters": {
            "type": "object",
            "properties": {
                "category": {
                    "type": "string",
                    "enum": ["chief_complaint", "onset", "provocation",
                             "quality", "region", "severity",
                             "timing", "red_flag"],
                },
                "detail": {"type": "string"},
            },
            "required": ["category", "detail"],
        },
    },
    {
        "type": "function",
        "name": "score_severity",
        "description": (
            "Score the patient's severity based on captured symptoms. "
            "Returns an ESI-style level (1=critical, 5=non-urgent)."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "symptoms": {"type": "array", "items": {"type": "string"}},
            },
            "required": ["symptoms"],
        },
    },
    {
        "type": "function",
        "name": "route_to_care_level",
        "description": "Route the patient to the appropriate care level.",
        "parameters": {
            "type": "object",
            "properties": {
                "level": {
                    "type": "string",
                    "enum": ["emergency", "urgent_care", "scheduled_visit",
                             "self_care"],
                },
                "reason": {"type": "string"},
            },
            "required": ["level", "reason"],
        },
    },
    {
        "type": "function",
        "name": "escalate_to_nurse",
        "description": (
            "Connect the patient to a live registered nurse immediately. "
            "Call this for any red-flag symptom or any time the protocol "
            "is unclear."
        ),
        "parameters": {
            "type": "object",
            "properties": {"reason": {"type": "string"}},
            "required": ["reason"],
        },
    },
    {
        "type": "function",
        "name": "log_to_ehr",
        "description": "Write structured triage notes to the EHR.",
        "parameters": {
            "type": "object",
            "properties": {
                "patient_id": {"type": "string"},
                "symptoms": {"type": "object"},
                "severity": {"type": "integer"},
                "disposition": {"type": "string"},
            },
            "required": ["patient_id", "symptoms", "severity", "disposition"],
        },
    },
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The score_severity tool is where your clinical algorithm lives. In the repo, it's a simple rule-based scorer for demonstration; in production, this is the function your clinical team reviews and signs off on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Severity scoring logic
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RED_FLAG_KEYWORDS = {
    "cardiac": ["chest pain", "pressure", "tight", "shortness of breath",
                "arm pain", "jaw pain", "sweating"],
    "stroke":  ["face drooping", "weakness", "slurred speech", "vision",
                "confusion"],
    "surgical":["severe abdominal", "blood in stool", "vomiting blood",
                "rigid abdomen"],
    "sepsis":  ["high fever", "stiff neck", "altered mental"],
    "mental":  ["suicidal", "self-harm", "kill myself"],
}

def score_severity(symptoms):
    text = " ".join(s.lower() for s in symptoms)
    for category, keywords in RED_FLAG_KEYWORDS.items():
        if any(kw in text for kw in keywords):
            return {"level": 1, "category": category, "route": "emergency"}
    if any(kw in text for kw in ["severe pain", "9/10", "10/10", "can't breathe"]):
        return {"level": 2, "route": "emergency"}
    if any(kw in text for kw in ["moderate pain", "7/10", "8/10", "fever 101", "fever 102"]):
        return {"level": 3, "route": "urgent_care"}
    if any(kw in text for kw in ["mild pain", "5/10", "6/10"]):
        return {"level": 4, "route": "scheduled_visit"}
    return {"level": 5, "route": "self_care"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is illustrative only. Real telehealth triage uses validated scoring (ESI, AMTS, organization-specific protocols) developed and reviewed by clinical staff. Don't ship anything to production without that review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Audit logging and PHI controls
&lt;/h2&gt;

&lt;p&gt;Every transcript event from the Voice Agent API is PHI. Treat it as such:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encrypt at rest.&lt;/strong&gt; Use envelope encryption (KMS) for any persisted audio or transcripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encrypt in transit.&lt;/strong&gt; The Voice Agent API WebSocket is TLS — no additional work there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit log every access.&lt;/strong&gt; Who read which call, when, from where.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apply PII redaction to anything that leaves your VPC.&lt;/strong&gt; Phone numbers, addresses, SSNs, names should be redacted before transcripts hit analytics warehouses or training pipelines.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set retention policies.&lt;/strong&gt; Most healthcare orgs retain triage call transcripts for 7 years; configure your storage accordingly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Voice Agent API's events (transcript.user, transcript.agent, tool.call, tool.result) are exactly what you'd write to the EHR. Build the log_to_ehr tool to flush a structured record at the end of every call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Test against representative cases
&lt;/h2&gt;

&lt;p&gt;Before any patient calls the agent, run it against a clinical test suite:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;Expected route&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"I have crushing chest pain and my left arm is numb"&lt;/td&gt;
&lt;td&gt;emergency (cardiac red flag)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"I have a fever of 102 and a stiff neck"&lt;/td&gt;
&lt;td&gt;emergency (sepsis red flag)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"I sprained my ankle yesterday, pain is 5 out of 10"&lt;/td&gt;
&lt;td&gt;urgent_care or scheduled_visit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"I have a runny nose and slight cough for two days"&lt;/td&gt;
&lt;td&gt;self_care&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"I'm having thoughts of hurting myself"&lt;/td&gt;
&lt;td&gt;escalate_to_nurse (mental health red flag)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Run at least 200 cases through the agent with clinician review of every disposition. The cost of a missed escalation is a clinical safety event; the cost of an over-escalation is overuse of the nurse line. Tune until both are within your organization's tolerance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Post-call documentation with Medical Mode
&lt;/h2&gt;

&lt;p&gt;After the call, run the captured audio through AssemblyAI's &lt;a href="https://www.assemblyai.com/docs/pre-recorded-audio/medical-mode" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt; async API for billing-grade clinical documentation. Enable it with the domain="medical-v1" parameter on a standard pre-recorded transcript request:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    domain="medical-v1",       # enables Medical Mode
    speaker_labels=True,        # provider/patient separation
    keyterms_prompt=["Lispro", "Humalog", "metoprolol"],
)
transcript = aai.Transcriber().transcribe(call_audio_url, config)
# Then send transcript.text through the LLM Gateway for SOAP generation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Medical Mode is purpose-built for medication names, procedures, conditions, and dosages — it's billed as a separate add-on (see &lt;a href="https://www.assemblyai.com/pricing" rel="noopener noreferrer"&gt;pricing&lt;/a&gt;). Combine it with &lt;a href="https://www.assemblyai.com/docs/guides/soap-note-generation" rel="noopener noreferrer"&gt;LLM Gateway SOAP generation&lt;/a&gt; to produce structured chart entries from the transcript.&lt;/p&gt;

&lt;h2&gt;
  
  
  The complete repository
&lt;/h2&gt;

&lt;p&gt;Fork the runnable repo at &lt;a href="https://github.com/kelsey-aai/telehealth-triage-voice-agent" rel="noopener noreferrer"&gt;github.com/kelsey-aai/telehealth-triage-voice-agent&lt;/a&gt;. It includes the triage agent loop, the OPQRST protocol prompt, the red-flag scorer, the routing logic, and a sample EHR adapter stub. Around 350 lines of Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How do I build a voice agent for telehealth triage?
&lt;/h3&gt;

&lt;p&gt;To build a voice agent for telehealth triage, open an AssemblyAI Voice Agent API session with a clinical-specialty system prompt that walks the patient through an OPQRST symptom protocol, screens for red flags, and routes via tool calls. The agent should never diagnose or prescribe — it captures symptoms with capture_symptom, scores severity with score_severity (your clinical algorithm), routes via route_to_care_level, and escalates to a live RN through escalate_to_nurse whenever red flags appear or the protocol is unclear. All of this runs inside one WebSocket at wss://agents.assemblyai.com/v1/ws, with audit logging, encrypted transcripts, and a BAA executed with AssemblyAI before any PHI is processed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use the Voice Agent API for healthcare workflows subject to HIPAA?
&lt;/h3&gt;

&lt;p&gt;AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum (BAA) for customers processing PHI. Before processing any PHI you need to execute the BAA with AssemblyAI — &lt;a href="https://www.assemblyai.com/contact/sales" rel="noopener noreferrer"&gt;contact our sales team&lt;/a&gt;. The Voice Agent API uses TLS for transit, supports PII redaction, and provides per-session audit logs. Your application also needs its own architecture aligned to HIPAA — encryption at rest, role-based access controls, audit logging, retention policies — to meet your obligations end-to-end.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can a telehealth voice agent diagnose patients?
&lt;/h3&gt;

&lt;p&gt;No. A telehealth triage voice agent should never diagnose, prescribe, or provide clinical decisions. Its role is to capture symptoms, run a defined triage protocol developed by licensed clinicians, score severity, and route the patient to the appropriate care level — emergency, urgent care, scheduled visit, or self-care. A human clinician (nurse, physician, NP) makes the final clinical decision. The system prompt should explicitly forbid diagnostic statements ("you have..." — never; "your symptoms suggest..." — only when leading into a routing decision).&lt;/p&gt;

&lt;h3&gt;
  
  
  How does the Voice Agent API handle medical terminology?
&lt;/h3&gt;

&lt;p&gt;The STT layer under the Voice Agent API is Universal-3 Pro Streaming, which performs well on conversational medical terminology like medication names and common conditions. For billing-grade clinical documentation — SOAP notes, ICD-10 candidate coding, structured chart entries — AssemblyAI's separate &lt;a href="https://www.assemblyai.com/docs/async-stt/medical-mode" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt; async API is purpose-built for clinical accuracy. Enable it with domain="medical-v1" on a pre-recorded transcript request. The common architecture is: real-time triage on the Voice Agent API, post-call documentation through Medical Mode async, both under the same BAA.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happens when the agent encounters a red flag?
&lt;/h3&gt;

&lt;p&gt;When the agent detects a red flag — cardiac symptoms (chest pain, arm pain, shortness of breath), stroke symptoms (facial drooping, slurred speech, weakness), surgical symptoms (severe abdominal pain), sepsis indicators (high fever with stiff neck), or mental health emergencies (suicidal ideation) — it should immediately call escalate_to_nurse with the reason, tell the patient "These symptoms need immediate attention. I'm connecting you to our on-call nurse right now," and hand off the call along with the captured symptoms. Red-flag escalation must be automatic, not conditional. Never let the agent continue triaging after a red flag is captured.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between this and a healthcare scheduling voice agent?
&lt;/h3&gt;

&lt;p&gt;A healthcare scheduling voice agent books appointments, verifies insurance, and handles prescription refills — administrative tasks where the worst-case error is a rescheduled appointment. A telehealth triage voice agent captures clinical symptoms and routes to care levels — clinical tasks where the worst-case error is a missed cardiac event. The two have different risk profiles, different prompts, different tools, and different review processes. A team building both should keep them as separate agents with separate audit trails. Our &lt;a href="https://www.assemblyai.com/blog/voice-agents-healthcare" rel="noopener noreferrer"&gt;healthcare voice agents guide&lt;/a&gt; covers the scheduling/administrative side.&lt;/p&gt;

</description>
      <category>voiceai</category>
      <category>ai</category>
      <category>healthcare</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
