<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mart Schweiger</title>
    <description>The latest articles on DEV Community by Mart Schweiger (@martschweiger).</description>
    <link>https://dev.to/martschweiger</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3802221%2Fcdb4c7a2-d4f4-444d-908e-30d6ea3bd1a7.png</url>
      <title>DEV Community: Mart Schweiger</title>
      <link>https://dev.to/martschweiger</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/martschweiger"/>
    <language>en</language>
    <item>
      <title>Why Streaming Transcription Drifts to English — and How to Fix It</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:26:42 +0000</pubDate>
      <link>https://dev.to/martschweiger/why-streaming-transcription-drifts-to-english-and-how-to-fix-it-30ll</link>
      <guid>https://dev.to/martschweiger/why-streaming-transcription-drifts-to-english-and-how-to-fix-it-30ll</guid>
      <description>&lt;p&gt;You built a multilingual voice product, tested it on Spanish audio, and it worked. Then it hit production traffic and started handing back English. A caller says "necesito ayuda con mi factura" and the transcript reads "I need help with my invoice" — translated, not transcribed — or worse, a phonetic English mush that means nothing downstream.&lt;/p&gt;

&lt;p&gt;If you're evaluating streaming speech-to-text for a multilingual product, this is the failure mode that quietly kills you in testing. It's not random, and it's not unfixable. Here's why streaming models drift to English, and how to steer them back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The drift is a confidence problem, not a language problem
&lt;/h2&gt;

&lt;p&gt;Most multilingual streaming models can transcribe your target languages. The trouble is what they do when they're &lt;em&gt;unsure&lt;/em&gt; — and streaming makes them unsure far more often than batch processing does.&lt;/p&gt;

&lt;p&gt;A pre-recorded model reads the whole file before committing. A streaming model has to decide what it heard within a few hundred milliseconds, working from a short rolling window of audio. Less evidence per decision means more uncertainty, and uncertainty is exactly where the default kicks in.&lt;/p&gt;

&lt;p&gt;That default is usually English. Most ASR training data skews heavily English, so when a model can't confidently place a sound in Spanish or German, the safest bet — statistically — is English. The model isn't broken. It's guessing, and its prior says English.&lt;/p&gt;

&lt;p&gt;Three situations push a streaming model into that low-confidence zone over and over:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short utterances.&lt;/strong&gt; "Sí." "Vale." "Mmhmm." A one-word turn gives the detector almost nothing to work with, so it falls back to its prior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code-switching boundaries.&lt;/strong&gt; When a speaker drops an English word into a Spanish sentence — "mándame el invoice" — a model that detects language per turn can latch onto the English token and flip the whole turn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Noise and accents.&lt;/strong&gt; 8kHz phone audio, background chatter, or an underrepresented regional accent all lower confidence, and low confidence trends toward English.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the drift isn't your audio being "too multilingual." It's the model hitting uncertainty and resolving it the wrong way.&lt;/p&gt;

&lt;p&gt;Test drift on your own multilingual audio&lt;/p&gt;

&lt;p&gt;Run your real production audio — short turns, code-switching, phone quality — through streaming transcription and see how it holds up. Start with a free account and clear docs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "just force the language" usually backfires
&lt;/h2&gt;

&lt;p&gt;The obvious instinct is to pin the language down — hard-code one language and be done with it. Sometimes that helps. Often it makes things worse, for one big reason: hard-coding one language breaks the moment a real conversation code-switches. Your Spanish caller says "my tracking number is..." and a single-language lock either drops it or mangles it. Real bilingual speech doesn't stay in one lane, so forcing one lane fights the audio.&lt;/p&gt;

&lt;p&gt;The nuance worth getting right: there's a difference between &lt;em&gt;blindly&lt;/em&gt; forcing a language and &lt;em&gt;correctly telling the model what you already know.&lt;/em&gt; If your audio genuinely is monolingual — a support line in Osaka that runs in Japanese, a clinic intake in Madrid that runs in Spanish — committing the model to that language is now the recommended way to steer (more on the language_code parameter below). The mistake is forcing a single language onto audio that actually mixes languages. For that, you steer with context and let the model code-switch.&lt;/p&gt;

&lt;p&gt;The fix isn't a bigger hammer. It's matching the model to your language reality and then giving its detector the signals it needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to fix language steering
&lt;/h2&gt;

&lt;p&gt;Think of language steering as five levers, in rough order of impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Match the model to how your users actually speak
&lt;/h3&gt;

&lt;p&gt;This is the highest-leverage decision, and it's where most drift gets solved before it starts. AssemblyAI gives you streaming paths that behave differently on multilingual audio:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;&lt;strong&gt;Universal-3.5 Pro Realtime&lt;/strong&gt;&lt;/a&gt; (universal-3-5-pro) — native code-switching across &lt;strong&gt;18 languages&lt;/strong&gt; in a single stream: English, Spanish, French, German, Italian, Portuguese, Arabic, Danish, Dutch, Hebrew, Hindi, Japanese, Mandarin, Vietnamese, Finnish, Norwegian, Swedish, and Turkish. It treats a mid-sentence switch — Hinglish included — as ordinary speech instead of a language to re-detect, which is exactly the behavior that prevents drift on bilingual calls. This is the model that holds the line, and it's the new default for realtime transcription.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Universal-Streaming Multilingual&lt;/strong&gt; (universal-streaming-multilingual) — a cost-effective path that covers a smaller set of languages with per-turn language detection. Per-turn detection is fine when speakers change languages &lt;em&gt;between&lt;/em&gt; turns, but it's more prone to flipping on intra-sentence switches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper-Streaming&lt;/strong&gt; (whisper-rt) — the 99+ language fallback when your languages fall outside the core set. Automatic language detection is built in and mandatory here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Picking a per-turn model for an intra-sentence code-switching product is the single most common cause of drift we see. Match the model first.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tell the model the language when you actually know it
&lt;/h3&gt;

&lt;p&gt;Universal-3.5 Pro Realtime runs in multilingual mode by default — the right behavior when you don't know what's coming. But most production calls aren't a guessing game. When you know the session is monolingual, pass the new language_code connection parameter. It commits the model to one language instead of asking it to detect one, which is the cleanest way to head off the wrong-language slips that creep in on short or ambiguous audio. This is now the recommended way to bias toward a single language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;StreamingParameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;speech_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;universal-3-5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;language_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;es&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# commit to Spanish when you know the call is monolingual
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Omit language_code and you keep full multilingual code-switching. For calls that genuinely mix languages, leave it off and steer with context instead — that's what the next lever is for.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Give the model the conversation as context
&lt;/h3&gt;

&lt;p&gt;A big share of low-confidence drift comes from the model hearing each moment cold, with no sense of what came before. Universal-3.5 Pro Realtime fixes that two ways. It keeps a short, rolling memory of the conversation (Context Carryover) and uses it as context for whatever comes next — on by default, nothing to configure. And for voice agents, you can pass the agent's own question in with agent_context, so a mumbled or short reply resolves through the lens of what was just asked. More context per decision means fewer of the uncertain moments that resolve toward English. You can also describe the audio with the prompt parameter to prime the model for a noisy line or a specific domain.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Anchor vocabulary with keyterms — and hear the speaker, not the room
&lt;/h3&gt;

&lt;p&gt;Drift often shows up first on the words that matter most: names, product terms, account types, medication names. Universal-3.5 Pro Realtime includes keyterms prompting at no extra cost, and it applies across all supported languages at once. Feeding the model your domain vocabulary keeps those terms anchored in the right language instead of getting "Englished" the moment confidence dips. And because background speech — a TV, a passenger — gets transcribed as phantom words that drag confidence down, voice_focus isolates the primary speaker and suppresses the rest (use near_field for headsets and phones, far_field for rooms and kiosks).&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Give the detector room to work
&lt;/h3&gt;

&lt;p&gt;A lot of self-inflicted drift comes from starving the model of context. If you've cranked silence thresholds down to chase latency, you're chopping audio into fragments too short to place a language confidently. Leave enough audio per turn for the model to commit, and you'll see fewer fallbacks — especially on the short acknowledgments that trip every system. Our &lt;a href="https://www.assemblyai.com/blog/real-time-transcription-python" rel="noopener noreferrer"&gt;real-time transcription guide&lt;/a&gt; shows a working streaming setup you can adapt to test these settings.&lt;/p&gt;

&lt;p&gt;See language steering live&lt;/p&gt;

&lt;p&gt;Drop multilingual and code-switched audio into the playground and watch real-time transcription hold the right language. No setup required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What "fixed" looks like in production
&lt;/h2&gt;

&lt;p&gt;When the model matches your audio and has the signals it needs, the drift stops being a coin flip. A bilingual support call stays bilingual in the transcript. The same speaker keeps producing Spanish when they speak Spanish and English when they speak English — including mid-sentence — and your downstream intent detection, routing, and analytics stop choking on phantom English.&lt;/p&gt;

&lt;p&gt;Universal-3.5 Pro Realtime does this without trading away responsiveness: end-of-turn detection reads tonality, pacing, and rhythm rather than silence alone and lands around 300ms, and the model posts a market-leading 6.99% pooled word error rate on Pipecat's open benchmark of real agent conversations. You don't have to choose between "stays in the right language" and "fast enough for a voice agent." If you're benchmarking options, our guide on &lt;a href="https://www.assemblyai.com/blog/how-to-evaluate-speech-recognition-models" rel="noopener noreferrer"&gt;how to evaluate speech recognition models&lt;/a&gt; walks through testing this properly — and the &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;Streaming Speech-to-Text API docs&lt;/a&gt; cover the configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;English drift isn't a quirk of your audio — it's a model resolving uncertainty toward its English-heavy prior, and streaming creates that uncertainty constantly. You fix it by steering, not guessing: match the model to your real language mix, tell it the language with language_code when you actually know it, feed it conversation context, anchor your vocabulary with keyterms, and give the detector enough audio to commit. Blindly pin a single language onto mixed audio and you'll win the demo and lose production. Steer instead, and the messy, code-switched, real-world call stays in the language it was actually spoken in.&lt;/p&gt;

&lt;p&gt;If multilingual reliability is a make-or-break for your product, test with the audio that actually breaks things — short turns, accents, mid-sentence switches, phone-quality lines. That's the audio that tells you whether a model steers or drifts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why does my streaming transcription keep switching to English on non-English audio?&lt;/strong&gt; Streaming models work from short audio windows, so they hit low-confidence moments often — short utterances, code-switching boundaries, noise, and accents. When confidence drops, most models fall back to an English-heavy statistical prior because that's what dominates ASR training data. The fix is to match the model to your language mix and feed it better signals — the language when you know it, plus conversation context — rather than guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should I force a single language to stop the drift?&lt;/strong&gt; Only when your audio really is monolingual. In that case, pass language_code to commit Universal-3.5 Pro Realtime to one language — that's the recommended way to steer. But hard-coding one language onto audio that actually code-switches breaks the moment a speaker mixes languages, so for genuinely bilingual calls you leave it off and steer with context instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which AssemblyAI model is best for multilingual audio that code-switches mid-sentence?&lt;/strong&gt; Universal-3.5 Pro Realtime (universal-3-5-pro). It supports native code-switching across 18 languages in one stream — including Hinglish and other mixed-language speech — and treats mid-sentence switches as ordinary speech rather than re-detecting language each turn, which is what prevents English drift. For languages outside those 18, Whisper-Streaming (whisper-rt) covers 99+ languages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between native code-switching and per-turn language detection?&lt;/strong&gt; Native code-switching transcribes language changes as they happen, including inside a sentence. Per-turn detection only identifies the language at a turn boundary, so a single English word inside a Spanish sentence can flip the whole turn to English. For natural bilingual conversation, native code-switching is the one that resists drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I reduce language drift in production?&lt;/strong&gt; Match your audio to Universal-3.5 Pro Realtime, set language_code for monolingual sessions, and lean on the model's rolling conversation memory and agent_context so each turn is decided with context rather than cold. Pair that with keyterms prompting to keep domain vocabulary anchored in the correct language, and voice_focus to stop background speech from dragging confidence down.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>speechtotext</category>
      <category>python</category>
    </item>
    <item>
      <title>Keyterm Prompting: Boost Names, Jargon &amp; Product Terms</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:26:17 +0000</pubDate>
      <link>https://dev.to/martschweiger/keyterm-prompting-boost-names-jargon-product-terms-j34</link>
      <guid>https://dev.to/martschweiger/keyterm-prompting-boost-names-jargon-product-terms-j34</guid>
      <description>&lt;p&gt;Your voice agent nails 98% of the conversation and then fumbles the one word that actually mattered — the caller's last name, your product SKU, the medication. "Metoprolol" comes back as "metoprolal." "Byrne-Donoghue" becomes "Byrne Donahue." The transcript looks great in the demo and falls apart on the exact tokens your downstream logic depends on.&lt;/p&gt;

&lt;p&gt;If you're evaluating real-time speech-to-text, this is the accuracy problem that decides whether the model is usable in production. And it's the one generic benchmarks hide, because they average it away. Here's the lever that fixes it — keyterm prompting — how it actually works, and how to use it without making things worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the words that matter are the hardest to get right
&lt;/h2&gt;

&lt;p&gt;It's not random which words a model misses. Domain-specific terms fail at roughly 3–5x the rate of general speech in off-the-shelf models, and the reason is training data. Common English words show up millions of times; your company name, your SKUs, a clinician's surname, an alphanumeric policy ID — those barely appear, if at all. The model has no strong prior for them, so the moment audio gets noisy or a speaker has an accent, it falls back to a more common word that sounds similar.&lt;/p&gt;

&lt;p&gt;That's the trap: the rare, high-value terms are exactly the ones most exposed to error, and they're the ones a transcript can least afford to get wrong. A misheard filler word costs nothing. A misheard account number breaks the call.&lt;/p&gt;

&lt;h2&gt;
  
  
  What keyterm prompting does
&lt;/h2&gt;

&lt;p&gt;Keyterm prompting is the most direct accuracy lever AssemblyAI gives you for this. You pass a list of the words and phrases that matter — names, brands, jargon, product terms — through the keyterms_prompt parameter, and the model biases toward recognizing them. It's an array of strings, and it works on both &lt;a href="https://www.assemblyai.com/products/speech-to-text" rel="noopener noreferrer"&gt;pre-recorded&lt;/a&gt; and &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;streaming&lt;/a&gt; transcription. It's part of the broader promptable interface AssemblyAI introduced with &lt;a href="https://www.assemblyai.com/blog/introducing-universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; and carried into the &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; flagship.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;keyterms_prompt&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;&lt;span class="s2"&gt;"Kelly Byrne-Donoghue"&lt;/span&gt;, &lt;span class="s2"&gt;"metoprolol"&lt;/span&gt;, &lt;span class="s2"&gt;"Universal-3.5 Pro"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's the whole interface. The interesting part is what happens behind it.&lt;/p&gt;

&lt;p&gt;Boost your hardest terms in real time&lt;/p&gt;

&lt;p&gt;Test keyterm prompting on your own names, jargon, and product terms with Universal-3.5 Pro Realtime. Free account, clear docs, no credit card.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it actually works: two boosting stages
&lt;/h2&gt;

&lt;p&gt;This is the part that comes up constantly and rarely gets explained. Streaming keyterm prompting isn't one mechanism — it's two, working in sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Word-level boosting&lt;/strong&gt; happens live, during inference. The model is biased toward your keyterms as words are emitted, so recognition improves in real time as the audio streams in. This stage is on by default.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn-level boosting&lt;/strong&gt; happens after each turn completes. A second pass re-examines the full turn against your keyterms list using &lt;strong&gt;metaphone-based&lt;/strong&gt; matching — phonetic matching, not exact spelling. Metaphone encodes how a word &lt;em&gt;sounds&lt;/em&gt;, so when the model hears "Byrne Donahue" and your keyterm is "Byrne-Donoghue," the phonetic codes line up and the term gets corrected to the spelling you specified. That's why keyterms fix names that are pronounced one way and spelled another. On Universal-3.5 Pro Realtime (universal-3-5-pro), turn-level boosting is always active; on Universal-Streaming English and Multilingual it kicks in when format_turns=true.&lt;/p&gt;

&lt;p&gt;The two stages stack: word-level catches the term as it's spoken, turn-level cleans up anything that slipped through using sound-alike matching. Together they target the precise failure mode — a term the model heard &lt;em&gt;almost&lt;/em&gt; right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using it in a streaming session
&lt;/h2&gt;

&lt;p&gt;On streaming, keyterms_prompt is set when you open the WebSocket and can be changed mid-stream. Here's the shape with the Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;StreamingParameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;speech_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;universal-3-5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;keyterms_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Keanu Reeves&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AssemblyAI&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Universal-3.5 Pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few limits worth knowing before you load it up, because they're easy to trip over:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming caps at 100 keyterms per session&lt;/strong&gt;, and each term must be 50 characters or less. Go over and the request errors; over-length terms are ignored. (Pre-recorded is far more generous — up to 1,000 words or phrases, max six words each.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On Universal-3.5 Pro Realtime, keyterms are included&lt;/strong&gt; at no extra cost, alongside the model's conversation context. On Universal-Streaming English and Multilingual, keyterm boosting is an add-on.&lt;/li&gt;
&lt;li&gt;You can combine keyterms_prompt with a prompt in the same request — keyterms get appended to your prompt automatically. (For more on shaping output with prompts, see our &lt;a href="https://www.assemblyai.com/blog/universal-3-pro-prompt-engineering" rel="noopener noreferrer"&gt;prompt engineering guide&lt;/a&gt;.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real unlock for voice agents is &lt;strong&gt;dynamic keyterms&lt;/strong&gt;. You don't have to commit to one list for the whole call. Send an UpdateConfiguration message and swap the terms as the conversation moves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UpdateConfiguration"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"keyterms_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"cardiology"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="s2"&gt;"echocardiogram"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Dr. Patel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metoprolol"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So when your agent reaches the identity-verification step, you load names and date-of-birth terms; when it moves to a medical intake, you swap in clinical vocabulary. You're priming the model for exactly what it's about to hear, stage by stage — which is one of the most effective ways to lift mid-call accuracy. (Our &lt;a href="https://www.assemblyai.com/blog/real-time-transcription-python" rel="noopener noreferrer"&gt;real-time transcription guide&lt;/a&gt; shows a full working setup, and the &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime release post&lt;/a&gt; covers dynamic prompting in context.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Keyterms aren't your only context lever anymore
&lt;/h2&gt;

&lt;p&gt;Keyterm prompting tells the model which &lt;em&gt;words&lt;/em&gt; to expect. Universal-3.5 Pro Realtime adds two more ways to give it context, and they stack with keyterms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Conversation context (Context Carryover)&lt;/strong&gt; is on by default. The model keeps a short, rolling memory of the call and uses it as context for the next turn, so it's no longer deciding each utterance cold. Across a 20,000-file voice agent benchmark, passing context cut word error rate by 10.2%, concentrated exactly where agents hurt: short utterances, names, and entities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;agent_context&lt;/strong&gt; lets a voice agent pass in the question it just asked. Prime the model with "What's your email address?" and a mumbled reply resolves to &lt;a href="mailto:user@assemblyai.com"&gt;user@assemblyai.com&lt;/a&gt; instead of "user at assembly a i dot com." Spelled-out account IDs, addresses, and one-word confirmations — the short utterances that wreck most realtime models — finally have the context to come out right.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use keyterms for the specific high-value vocabulary your application can't lose, and use context to sharpen everything around it.&lt;/p&gt;

&lt;p&gt;See keyterm boosting on your own audio&lt;/p&gt;

&lt;p&gt;Drop in a recording with tricky names or jargon and watch keyterm prompting correct them live — no code required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use keyterms without making accuracy worse
&lt;/h2&gt;

&lt;p&gt;More keyterms is not better. Overloading the list — or stuffing it with common words — causes overcorrection and hallucinations, where the model forces a keyterm onto audio that didn't contain it. The guidance that holds up in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Start with none.&lt;/strong&gt; Run your real audio first, find the terms the model consistently misses, and add only those. Don't pre-load a dictionary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Match spelling and capitalization exactly&lt;/strong&gt; to the output you want. Keyterms double as a spelling instruction — "Byrne-Donoghue," not "byrne donoghue."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip common words.&lt;/strong&gt; "Information," "account," "today" — the model already handles these, and boosting them just invites false matches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep the list tight.&lt;/strong&gt; A focused set of genuinely hard, genuinely important terms beats a long list every time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it as a scalpel for the specific words your application can't afford to lose, not a blanket over the whole transcript.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where accents fit in
&lt;/h2&gt;

&lt;p&gt;Keyterm prompting is also one of your best levers for accented speech — a question that comes up constantly, especially for heavier accents like Irish, Scottish, or strong regional English. Accents lower the model's confidence on individual words, and low confidence is exactly when a name or technical term slips toward a more common sound-alike. Because turn-level boosting matches phonetically, keyterms are well suited to catch those: the accented pronunciation still maps to the metaphone code of the term you specified.&lt;/p&gt;

&lt;p&gt;That said, keyterms aren't the whole answer for accents. Pair them with the right model — Universal-3.5 Pro Realtime shows consistent improvements on accented speech, and its voice_focus option isolates the primary speaker so a noisy room or a second voice doesn't drag accuracy down further — and, critically, &lt;strong&gt;test with audio that matches your actual users&lt;/strong&gt;. A model's accuracy on a clean American-English benchmark tells you almost nothing about how it handles an Irish caller on a noisy phone line. Boost the names and terms that matter, run your real accents through it, and measure what you actually get. Our guide on &lt;a href="https://www.assemblyai.com/blog/how-to-evaluate-speech-recognition-models" rel="noopener noreferrer"&gt;how to evaluate speech recognition models&lt;/a&gt; covers building that kind of test set, including Missed Entity Rate — the metric that actually captures whether your high-value terms survive.&lt;/p&gt;

&lt;p&gt;Keyterms also aren't your only prompting lever. When the vocabulary you need to boost is unknown or varies call to call, open-field &lt;a href="https://www.assemblyai.com/blog/speech-to-text-prompting-assemblyai-universal-3-pro" rel="noopener noreferrer"&gt;speech-to-text prompting&lt;/a&gt; gives you behavioral control — formatting, verbatim vs. clean, domain context — that a term list can't. On streaming you can run a prompt and keyterms_prompt together; on pre-recorded they're mutually exclusive, so you pick one per request.&lt;/p&gt;

&lt;p&gt;Build accurate transcription into your product&lt;/p&gt;

&lt;p&gt;Get real-time speech-to-text with keyterm prompting, strong accented-speech accuracy, and dynamic mid-call control. Start free and ship in an afternoon.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Headline accuracy numbers don't decide whether a transcription model works for you — accuracy on the handful of words your application depends on does. Keyterm prompting is the lever that targets exactly those words, through real-time biasing plus a phonetic cleanup pass that fixes sound-alike errors on names and jargon. Use it surgically: start empty, add the terms the model actually misses, spell them the way you want them, and update them as the conversation moves. Layer in conversation context and agent_context on top, and the demo that fell apart on a customer's name becomes the system that gets it right on the first try — which, for anything past the prototype stage, is the only accuracy that counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is keyterm prompting in speech-to-text?&lt;/strong&gt; Keyterm prompting is a feature that improves transcription accuracy for specific words and phrases you supply through the keyterms_prompt parameter — names, brands, jargon, product terms, and other domain vocabulary. The model biases toward recognizing those terms, so the high-value words your application depends on are far less likely to be mis-transcribed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I improve recognition accuracy for names, jargon, and product terms?&lt;/strong&gt; Pass them as keyterms. With AssemblyAI you provide an array of terms via keyterms_prompt (up to 100 per streaming session, up to 1,000 for pre-recorded audio), spelled exactly as you want them to appear. The model applies real-time word-level boosting plus a metaphone-based turn-level pass that corrects phonetically similar mistakes — so "Byrne Donahue" becomes "Byrne-Donoghue." On Universal-3.5 Pro Realtime you can stack conversation context and agent_context on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How can I transcribe audio with heavy accents, like an Irish accent?&lt;/strong&gt; Use a model with strong accented-speech accuracy (Universal-3.5 Pro Realtime), turn on voice_focus to isolate the primary speaker, then add keyterm prompting for the names and terms accents most often distort — phonetic boosting maps the accented pronunciation back to the spelling you specified. Most importantly, test with audio from your actual user demographics rather than clean benchmark clips, since accent accuracy is highly specific to the speakers you serve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does keyterm prompting work in real time?&lt;/strong&gt; Yes. On streaming, keyterms are applied live through word-level boosting as audio arrives, then refined by a turn-level boosting pass after each turn. You can also update the keyterms list mid-stream with an UpdateConfiguration message — useful for voice agents that move through stages (verification, intake, payment) with different vocabulary at each.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can adding too many keyterms hurt accuracy?&lt;/strong&gt; Yes. Overloading the list or including common words can cause overcorrection and hallucinations, where the model forces a keyterm onto audio that didn't contain it. Start with no keyterms, add only the terms the model consistently misses, keep the list focused, and avoid common words the model already handles well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How many keyterms can I use, and do they cost extra?&lt;/strong&gt; Streaming allows up to 100 keyterms per session (50 characters max each); pre-recorded allows up to 1,000 words or phrases (six words max per phrase). Keyterm prompting is included at no extra cost on Universal-3.5 Pro Realtime, alongside conversation context; on Universal-Streaming English and Multilingual it's an add-on.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>speechtotext</category>
      <category>python</category>
    </item>
    <item>
      <title>Real-Time STT Latency: What "Fast Enough" Means</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:26:10 +0000</pubDate>
      <link>https://dev.to/martschweiger/real-time-stt-latency-what-fast-enough-means-3d7k</link>
      <guid>https://dev.to/martschweiger/real-time-stt-latency-what-fast-enough-means-3d7k</guid>
      <description>&lt;p&gt;Every voice agent evaluation eventually lands on the same fight: latency. Someone pulls up a benchmark, sees one provider faster than another, and declares a winner. The problem is that the two numbers often aren't measuring the same thing — and even when they are, the faster one isn't always the right call.&lt;/p&gt;

&lt;p&gt;If you're choosing a streaming speech-to-text model for a voice agent, latency is the metric most likely to make or break the decision. So it's worth getting precise about what to measure, what the real numbers are, and what "fast enough" actually means — because chasing the lowest number on a chart is how teams end up with a fast agent that mishears every account number.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Fast enough" starts with how humans talk
&lt;/h2&gt;

&lt;p&gt;Here's the target you're actually aiming for. Research on human conversation finds the average gap between turns is around 200 milliseconds — that's the rhythm people expect, and anything much slower starts to feel like a pause, a lag, a robot thinking.&lt;/p&gt;

&lt;p&gt;But that 200ms is the gap between two &lt;em&gt;humans&lt;/em&gt;. A voice agent has to do more in that window. When someone stops talking, the full response loop is speech-to-text, then the LLM deciding what to say, then text-to-speech producing the audio. STT is one slice of that budget — and usually not the biggest one. The LLM is often the long pole.&lt;/p&gt;

&lt;p&gt;That reframes the whole latency question. The goal isn't to make STT as close to zero as possible. It's to make STT fast enough that it isn't the bottleneck, while leaving accuracy intact. A model that shaves 80ms off transcription but drops a digit in a phone number hasn't made your agent faster — it's made it wrong, faster. (If you'd rather not wire up STT, the LLM, and TTS yourself, the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt; bundles all three into one connection built on Universal-3.5 Pro Realtime.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The metrics that actually matter
&lt;/h2&gt;

&lt;p&gt;Most latency arguments fall apart because people use one word — "latency" — for several different measurements. Three matter for voice agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to first token (TTFT)&lt;/strong&gt; is how quickly you get the first partial transcript after speech starts. It's what powers barge-in detection and speculative LLM inference — letting your agent start reasoning before the user finishes. With Universal-3.5 Pro Realtime, the interruption_delay parameter tunes this directly: lower it toward 0 for the earliest possible signal, raise it for fewer, more confident partials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to complete turn (TTCT)&lt;/strong&gt; is the one that decides how responsive your agent &lt;em&gt;feels&lt;/em&gt;. It's the interval from when the speaker stops talking to when the final transcript segment arrives — the moment your LLM can actually act. Universal-3.5 Pro Realtime's end-of-turn detection reads &lt;em&gt;how&lt;/em&gt; someone speaks — tonality, pacing, rhythm — not just silence, and lands around 300ms. That's the number to anchor on for turn-based conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P50 vs. P95&lt;/strong&gt; is the distinction that separates demos from production. Median (P50) latency tells you the typical case. P95 tells you what one in twenty turns looks like — and in a real call, that tail is where conversations stall, agents talk over people, and the experience falls apart. A model with a great median and an ugly P95 will demo beautifully and frustrate users in production.&lt;/p&gt;

&lt;p&gt;Benchmark latency on your own audio&lt;/p&gt;

&lt;p&gt;Test real-time transcription speed and accuracy with Universal-3.5 Pro Realtime on your actual call audio. Free account, no credit card.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The real numbers
&lt;/h2&gt;

&lt;p&gt;So where does Universal-3.5 Pro Realtime land? Accuracy claims are easy to make and hard to verify, so the honest test is real agent conversations, not clean read speech. On Pipecat's open STT benchmark — measured on actual voice agent audio — Universal-3.5 Pro Realtime posts a &lt;strong&gt;market-leading pooled word error rate of 6.99%&lt;/strong&gt;, against Deepgram Flux at 15.58%, ElevenLabs Scribe v2 at 9.76%, and Google Chirp3 at 9.04%.&lt;/p&gt;

&lt;p&gt;The gap widens on the tokens voice agents actually act on. Entity error rate tells you whether names, places, and numbers survive:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Universal-3.5 Pro Realtime&lt;/th&gt;
&lt;th&gt;Deepgram Flux&lt;/th&gt;
&lt;th&gt;ElevenLabs Scribe v2&lt;/th&gt;
&lt;th&gt;Google Chirp3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Word error rate&lt;/td&gt;
&lt;td&gt;6.99%&lt;/td&gt;
&lt;td&gt;15.58%&lt;/td&gt;
&lt;td&gt;9.76%&lt;/td&gt;
&lt;td&gt;9.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Entity error rate&lt;/td&gt;
&lt;td&gt;15.31%&lt;/td&gt;
&lt;td&gt;50.50%&lt;/td&gt;
&lt;td&gt;39.70%&lt;/td&gt;
&lt;td&gt;21.51%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Names&lt;/td&gt;
&lt;td&gt;16.92%&lt;/td&gt;
&lt;td&gt;39.21%&lt;/td&gt;
&lt;td&gt;38.03%&lt;/td&gt;
&lt;td&gt;22.10%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Places&lt;/td&gt;
&lt;td&gt;6.28%&lt;/td&gt;
&lt;td&gt;14.86%&lt;/td&gt;
&lt;td&gt;34.06%&lt;/td&gt;
&lt;td&gt;10.04%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phone numbers&lt;/td&gt;
&lt;td&gt;3.55%&lt;/td&gt;
&lt;td&gt;10.41%&lt;/td&gt;
&lt;td&gt;4.78%&lt;/td&gt;
&lt;td&gt;4.95%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 3.55% phone-number error rate is the difference between an agent that calls the right line back and one that invents a digit. Lower is better on every row, and the full methodology lives on &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;/benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's the part that ties latency and accuracy together. At end-of-turn detection around 300ms, transcription isn't your bottleneck — your LLM and TTS will each cost more than that. Shaving STT to 250ms doesn't make the loop feel meaningfully faster, because you've optimized the smallest slice. But losing the customer's policy number does break the call. Fast enough &lt;em&gt;plus&lt;/em&gt; accurate beats fastest-but-wrong every time the conversation contains something that matters — and with this model you no longer trade one for the other.&lt;/p&gt;

&lt;p&gt;See real-time accuracy and speed live&lt;/p&gt;

&lt;p&gt;Drop in call audio and watch Universal-3.5 Pro Realtime handle names, numbers, and turn-taking in real time — no code required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to hit your latency target
&lt;/h2&gt;

&lt;p&gt;"Fast enough" is also tunable, which most benchmark comparisons miss entirely. Instead of hand-tuning a stack of low-level flags, Universal-3.5 Pro Realtime lets you pick a &lt;strong&gt;mode&lt;/strong&gt; when you open the WebSocket:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;min_latency&lt;/strong&gt; for the fastest transcripts, when responsiveness is everything.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;balanced&lt;/strong&gt; (the default) for strong all-around performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;max_accuracy&lt;/strong&gt; for noisy, far-field audio where getting the words right is worth a little more time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From there, the levers that move the finer numbers (our &lt;a href="https://www.assemblyai.com/blog/real-time-transcription-python" rel="noopener noreferrer"&gt;real-time transcription guide&lt;/a&gt; shows how to set them in code):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;interruption_delay&lt;/strong&gt; controls TTFT — how soon the first partial lands. Drop it when you're running speculative inference or aggressive barge-in; raise it for fewer, more confident partials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;min_turn_silence&lt;/strong&gt; controls how quickly an end-of-turn check fires. Lower means snappier turn completion; the trade-off is that setting it too low can split entities like phone numbers mid-sequence — exactly the kind of accuracy loss that masquerades as a latency win.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speculative inference on partials&lt;/strong&gt; lets your LLM start working off early partials instead of waiting for the final transcript, hiding STT latency behind reasoning you were going to do anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pass agent_context.&lt;/strong&gt; The model takes your agent's question as input, so the reply resolves with more context and fewer re-tries. Across a 20,000-file voice agent benchmark, passing agent context cut word error rate by 10.2% — accuracy you get without spending latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Close your sessions.&lt;/strong&gt; Streaming is billed on connection duration, not audio — a separate point from latency, but the same discipline of not leaving the pipe open.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The other free lever is infrastructure: Universal-3.5 Pro Realtime runs with unlimited concurrency, so your P95 doesn't degrade under load the way rate-limited services do. The latency you measure at one stream is the latency you get at a thousand. &lt;a href="https://www.assemblyai.com/blog/where-voice-agent-stacks-start-showing-their-limits" rel="noopener noreferrer"&gt;Production teams hit this wall&lt;/a&gt; when a model that benchmarked well on a single connection falls over at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't optimize the wrong number
&lt;/h2&gt;

&lt;p&gt;The most common mistake in voice-agent evals is picking the model with the lowest median latency on a slide and moving on. That decision ignores two things that decide real-world quality: the P95 tail that determines whether turns stall under load, and the entity accuracy that determines whether the transcript is worth acting on. For a deeper framework, our guide on &lt;a href="https://www.assemblyai.com/blog/how-to-evaluate-speech-recognition-models" rel="noopener noreferrer"&gt;how to evaluate speech recognition models&lt;/a&gt; covers measuring latency at the percentile and accuracy level you'll actually deploy at, and the &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime release notes&lt;/a&gt; detail the turn-detection design behind these numbers.&lt;/p&gt;

&lt;p&gt;Set your latency budget from the full loop, not the STT line item. Decide what total response time feels natural for your agent, subtract realistic LLM and TTS costs, and you'll usually find STT has comfortable headroom around the 300ms range. Within that band, the right question stops being "which model is fastest" and becomes "which model is fast enough &lt;em&gt;and&lt;/em&gt; gets the words right." That's the one that wins the eval that matters — the one your users run every time they talk to your agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Latency is the right thing to obsess over for voice agents, but "lowest number on the chart" is the wrong way to obsess over it. Anchor on time to complete turn, watch P95 not just the median, and measure latency and accuracy together — because a transcript that arrives 80ms sooner with a mangled account number is slower in every way that counts. Universal-3.5 Pro Realtime is built for that reality: end-of-turn detection around 300ms to stay out of your response budget's way, and a market-leading 6.99% pooled word error rate so the words it delivers in that time are the right ones. If you're evaluating, &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;test it on streaming&lt;/a&gt; with your own audio at your own settings — the only benchmark that ever really mattered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the typical latency for real-time speech-to-text?&lt;/strong&gt; It depends on what you measure. For voice agents, the metric that matters most is time to complete turn (TTCT) — the gap from when a speaker stops to when the final transcript arrives. Universal-3.5 Pro Realtime's end-of-turn detection reads tonality, pacing, and rhythm rather than silence alone and lands around 300ms. Fast streaming models generally aim for the sub-350ms range on turn completion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do latency requirements differ for real-time transcription versus batch?&lt;/strong&gt; Batch (pre-recorded) transcription has no latency constraint — it optimizes purely for accuracy with the full file as context. Real-time streaming has to commit within a few hundred milliseconds from a short rolling window, so it's evaluated on TTFT and TTCT alongside accuracy. Don't compare a batch model's accuracy to a streaming model's; benchmark streaming at the latency you'll actually deploy at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can a voice agent respond in under 500ms?&lt;/strong&gt; The full response loop is STT + LLM + TTS, so sub-500ms end-to-end is tight but the STT slice is small: Universal-3.5 Pro Realtime's end-of-turn detection lands around 300ms, and you can pull the first partial earlier (via interruption_delay) to start LLM inference sooner. Most of a sub-500ms budget is won or lost in the LLM and TTS stages and through speculative inference, not in shaving milliseconds off transcription.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why isn't the lowest-latency model always the best choice for a voice agent?&lt;/strong&gt; Because STT latency is only one slice of the response loop, and the words still have to be right. A model that's faster on median but misses entities — names, numbers, emails — produces wrong actions faster, not better conversations. Evaluate latency (especially P95) and entity accuracy together; on Pipecat's agent-conversation benchmark, Universal-3.5 Pro Realtime leads on both word error rate (6.99%) and entity error rate (15.31%).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between P50 and P95 latency, and which should I track?&lt;/strong&gt; P50 (median) is the typical turn; P95 is the slow one in twenty. For production voice agents, P95 is often more telling, because the tail is where conversations stall and the agent talks over the user. Universal-3.5 Pro Realtime runs with unlimited concurrency, so its tail latency holds under load instead of degrading the way rate-limited services do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I tune Universal-3.5 Pro Realtime for my latency target?&lt;/strong&gt; Start with a mode — min_latency, balanced (default), or max_accuracy — instead of hand-tuning low-level flags. From there, interruption_delay shapes how soon the first partial lands and min_turn_silence shapes how fast a turn completes. Pass agent_context to lift accuracy without spending latency, and run speculative inference on partials to hide STT latency behind LLM reasoning.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voiceassistant</category>
      <category>speechtotext</category>
      <category>python</category>
    </item>
    <item>
      <title>Build a Voice Agent Without Pipecat or LiveKit</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:25:44 +0000</pubDate>
      <link>https://dev.to/martschweiger/build-a-voice-agent-without-pipecat-or-livekit-1igc</link>
      <guid>https://dev.to/martschweiger/build-a-voice-agent-without-pipecat-or-livekit-1igc</guid>
      <description>&lt;p&gt;If you've started building a voice agent in the last year, you've hit this question fast: do I need Pipecat or LiveKit?&lt;/p&gt;

&lt;p&gt;The internet says yes. Every tutorial reaches for an orchestration framework before it writes a line of agent logic. And for good reason — those frameworks are genuinely excellent, and AssemblyAI ships drop-in plugins for both. If you're already running one, keep running it.&lt;/p&gt;

&lt;p&gt;But "do I need a framework" is the wrong first question. The real question is: &lt;strong&gt;what moves the audio, and what wires the pipeline together?&lt;/strong&gt; Answer those two and the framework question answers itself. A lot of the time, the honest answer is: you don't need one.&lt;br&gt;
This post is an architecture overview of the framework-free path. We'll cover the four things people actually mean when they ask about going without Pipecat or LiveKit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What a voice pipeline framework actually does for you (and what it costs).&lt;/li&gt;
&lt;li&gt;The transport question everyone conflates — SIP vs. WebSocket.&lt;/li&gt;
&lt;li&gt;Connecting a phone number with Twilio.&lt;/li&gt;
&lt;li&gt;Whether the framework-free architecture holds up at enterprise scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's start with what you'd be removing.&lt;/p&gt;
&lt;h2&gt;
  
  
  What a framework actually does for you
&lt;/h2&gt;

&lt;p&gt;A voice agent has to do a few things in a tight loop, every few hundred milliseconds: turn speech into text, decide what to say, turn that back into speech, and move audio in and out — while handling interruptions when the caller talks over the bot.&lt;/p&gt;

&lt;p&gt;Historically, each of those was a different vendor. Speech-to-text from one provider, an LLM from another, text-to-speech from a third. Something has to sit in the middle and conduct: route partial transcripts to the LLM, stream the LLM's tokens to the TTS engine, manage turn-taking, and handle barge-in. That conductor is what Pipecat and LiveKit Agents give you, plus a transport layer to get audio to and from the user.&lt;/p&gt;

&lt;p&gt;That's real work, and the frameworks do it well. The thing is, it's only &lt;em&gt;necessary&lt;/em&gt; work if your pipeline is actually a multi-vendor pipeline.&lt;/p&gt;

&lt;p&gt;Here's the cost side, because it's easy to miss when you're following a quickstart. A framework is another dependency in your stack — one more thing to deploy, version, monitor, and reason about when something goes wrong at 2am. You're still wiring three model vendors together, so you still own three sets of API keys, three billing relationships, three failure modes, and three latency budgets that stack. The framework hides that complexity. It doesn't remove it.&lt;/p&gt;

&lt;p&gt;So the question becomes: what if the pipeline weren't multi-vendor?&lt;/p&gt;
&lt;h2&gt;
  
  
  The architecture without a framework
&lt;/h2&gt;

&lt;p&gt;Here's the shift. AssemblyAI's &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt; collapses speech-to-text, the LLM, and text-to-speech into a single WebSocket connection. You stream audio in, you get audio out. Turn detection, interruption handling, and tool calling happen inside that one connection.&lt;/p&gt;

&lt;p&gt;When the whole pipeline lives behind one API, there's nothing left to orchestrate. Your "conductor" becomes a single open socket.&lt;br&gt;
Compare the two topologies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;With a framework &lt;span class="o"&gt;(&lt;/span&gt;cascading, multi-vendor&lt;span class="o"&gt;)&lt;/span&gt;:

    ┌──────────── your orchestration framework ────────────┐

Caller ─▶│  transport ─▶ STT vendor ─▶ LLM vendor ─▶ TTS vendor │─▶ Caller
└──────────────────────────────────────────────────────┘
&lt;span class="o"&gt;(&lt;/span&gt;you deploy, scale, and monitor this whole box&lt;span class="o"&gt;)&lt;/span&gt;

Without a framework &lt;span class="o"&gt;(&lt;/span&gt;single API&lt;span class="o"&gt;)&lt;/span&gt;:

Caller ─▶ thin transport relay ─▶ Voice Agent API ─▶ Caller
&lt;span class="o"&gt;(&lt;/span&gt;STT + LLM + TTS, one socket&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Voice Agent API runs on &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; for the speech layer — the model that leads Pipecat's agent-conversation benchmark on both word error rate (6.99%) and entity accuracy, and nails short-utterance handling ("yes," "no," "mmhmm"). It takes the agent's question as context, keeps a rolling memory of the conversation, isolates the speaker from the room, and runs across 18 languages with native code-switching. It's framed as invisible infrastructure on purpose: one connection, one bill, one set of logs. It runs around one second end-to-end, and because there's no SDK to adopt, it works with coding agents like Claude Code out of the box.&lt;/p&gt;

&lt;p&gt;What you build instead of an orchestration layer is a thin relay: take audio from wherever your user is, forward it to the WebSocket, and play back what comes off it. That's it. Let's wire it up — first in code, then over the phone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connect from your own server
&lt;/h3&gt;

&lt;p&gt;Server-side and native clients connect directly with your API key in the Authorization header. Create an agent once with a single REST call, then connect to it by agent_id:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://agents.assemblyai.com/v1/agents &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: &lt;/span&gt;&lt;span class="nv"&gt;$ASSEMBLYAI_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "name": "Support Assistant",
    "system_prompt": "You are a friendly support agent. Keep replies short and
natural.",
    "greeting": "Hey there, what can I help with?",
    "voice": { "voice_id": "ivy" }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That returns an id. Now open the WebSocket and bind to it. The agent's stored prompt, voice, and tools load automatically, so you don't resend them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;

&lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wss://agents.assemblyai.com/v1/ws&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;AGENT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;7ad24396-b822-4dca-871a-be9cc4781cf9&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# from the create call above
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;websockets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra_headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Bind to the stored agent; no inline config needed.
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.update&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AGENT_ID&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;}))&lt;/span&gt;

        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.ready&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ready:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="c1"&gt;# start streaming input.audio frames here
&lt;/span&gt;            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript.agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reply.audio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;pcm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;  &lt;span class="c1"&gt;# play this
&lt;/span&gt;            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session.error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice there's no pipeline graph, no service registry, no turn-detection plugin to configure. The events — session.ready, transcript.agent, reply.audio, input.speech.started for barge-in — are the whole protocol, and it's identical across every transport. For the browser, you'd mint a short-lived token server-side and open wss://agents.assemblyai.com/v1/ws?token= so your key never ships to the client. Same session.update, same events. The full browser walkthrough is in the &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API quickstart&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;See real-time Voice AI in action&lt;/p&gt;

&lt;p&gt;Drop in call audio and watch Universal-3.5 Pro Realtime handle names, numbers, and turn-taking in real time — no code required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The transport question: SIP vs. WebSocket
&lt;/h2&gt;

&lt;p&gt;Here's where most of the confusion lives, so let's settle it.&lt;br&gt;
Audio has to physically move between the user and your agent. There are three common ways to move it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WebRTC&lt;/strong&gt; — the browser-and-app real-time standard. It handles NAT traversal, jitter buffering, and echo cancellation. This is LiveKit's and Daily's home turf, and it's the right tool for multi-party rooms, video, and rich client SDKs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SIP&lt;/strong&gt; — the signaling protocol of the telephone network. If you're terminating raw PSTN calls or running your own telephony infrastructure, you're in SIP territory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket&lt;/strong&gt; — a plain, bidirectional socket. No media server, no SFU, no signaling dance. You send audio frames, you receive audio frames.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AssemblyAI takes audio over a &lt;strong&gt;WebSocket&lt;/strong&gt;. That's the entire transport story for the STT layer and the Voice Agent API alike.&lt;/p&gt;

&lt;p&gt;"But my users are on the phone — don't I need SIP?" This is the part people get wrong. You don't speak SIP yourself, and you don't stand up a WebRTC media server either. Your telephony provider does that for you. Twilio terminates the PSTN call, handles the SIP side, and hands you the call's audio over — you guessed it — a WebSocket. So the agent is WebSocket on the user side and WebSocket on the AssemblyAI side. It's sockets all the way through.&lt;/p&gt;

&lt;p&gt;That's why you can skip the framework's transport layer for telephony: the carrier already converted the hard part into a WebSocket before it reaches you. Your job is just to bridge two sockets.&lt;/p&gt;
&lt;h2&gt;
  
  
  Connecting a phone number with Twilio
&lt;/h2&gt;

&lt;p&gt;Let's make the phone case concrete, because it's the one that sounds like it should need the most machinery and actually needs the least.&lt;br&gt;
The end-to-end path is four hops:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Caller ↔ Twilio Media Streams ↔ Your server ↔ Voice Agent API
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Twilio handles the phone network. Your server is a thin bridge. The agent handles speech-to-speech. And here's the detail that removes a whole class of pain: Twilio's native G.711 μ-law audio is byte-compatible with the Voice Agent API's audio/pcmu encoding, so your bridge forwards audio &lt;strong&gt;as-is, with zero transcoding or resampling&lt;/strong&gt;.&lt;br&gt;
When a call comes in, Twilio hits a webhook on your server. You return TwiML that opens a Media Streams WebSocket:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/twiml&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;callId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;newCallId&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hostname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;HOSTNAME&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^https&lt;/span&gt;&lt;span class="se"&gt;?&lt;/span&gt;&lt;span class="sr"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\/\/&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;streamUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`wss://&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/media-stream/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;callId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text/xml&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`&amp;lt;Response&amp;gt;
  &amp;lt;Connect&amp;gt;
    &amp;lt;Stream url="&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;streamUrl&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;" /&amp;gt;
  &amp;lt;/Connect&amp;gt;
&amp;lt;/Response&amp;gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When Twilio opens that Media Streams socket, your server opens a parallel connection to the Voice Agent API and configures the session inline — telling it to speak μ-law in both directions to match Twilio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;aaiWs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wss://agents.assemblyai.com/v1/ws&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ASSEMBLYAI_API_KEY&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;aaiWs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;session.update&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;You are a helpful voice assistant.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;greeting&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Hi, thanks for calling. How can I help?&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;audio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;encoding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;audio/pcmu&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;audio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;voice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ivy&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;encoding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;audio/pcmu&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="cm"&gt;/* your tool definitions */&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once session.ready fires, the whole bridge is two forwarding rules plus one for interruptions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Caller → Agent: each Twilio media event becomes an input.audio event.&lt;/span&gt;
&lt;span class="nx"&gt;tw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;media&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;media&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;track&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;inbound&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nx"&gt;aaiWs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;input.audio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;media&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="p"&gt;}));&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;aaiWs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;message&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

  &lt;span class="c1"&gt;// Agent → Caller: each reply.audio event becomes a Twilio media action.&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;reply.audio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;tw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;media&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;streamSid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;streamSid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;media&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// Barge-in: caller talks over the agent → clear Twilio's buffer so it stops&lt;/span&gt;
&lt;span class="nx"&gt;instantly&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;input.speech.started&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;tw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;clear&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;streamSid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;streamSid&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a production phone agent. No SIP stack, no media server, no orchestration framework — a webhook and a socket bridge. AssemblyAI maintains a complete &lt;a href="https://github.com/AssemblyAI/voice-agent-api-twilio-example" rel="noopener noreferrer"&gt;Twilio example repo&lt;/a&gt; with inbound and outbound calling and tool handling wired up, and the full walkthrough lives in &lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio" rel="noopener noreferrer"&gt;Connect to Twilio&lt;/a&gt;. The same pattern applies to other carriers — Twilio, Vonage, and the rest all hand you a media WebSocket.&lt;/p&gt;

&lt;h2&gt;
  
  
  What if you only want to swap the STT layer?
&lt;/h2&gt;

&lt;p&gt;Not everyone is building net-new. Maybe you already have an LLM and a TTS voice you love, and you only want better transcription — the rip-and-replace-one-piece path. You can do that without a framework too.&lt;br&gt;
Connect straight to the &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;streaming speech-to-text&lt;/a&gt; WebSocket and keep the rest of your stack exactly as it is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlencode&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;  &lt;span class="c1"&gt;# pip install websocket-client
&lt;/span&gt;
&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="c1"&gt;# Streaming uses speech_model (singular). universal-3-5-pro is the flagship realtime model.
&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;speech_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;universal-3-5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;ENDPOINT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wss://streaming.assemblyai.com/v3/ws?&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;urlencode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_of_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Final:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;   &lt;span class="c1"&gt;# → send to your own LLM
&lt;/span&gt;        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Partial:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SpeechStarted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Barge-in: user started speaking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# → interrupt your TTS
&lt;/span&gt;
&lt;span class="c1"&gt;# Authenticate with the raw key in the header — no "Bearer" prefix.
&lt;/span&gt;&lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;WebSocketApp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                            &lt;span class="n"&gt;on_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;on_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_forever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# Stream 16 kHz PCM16 audio in ~50 ms chunks as binary WebSocket frames.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You hold the loop — partial transcripts to your LLM, tokens to your TTS, SpeechStarted to trigger barge-in. It's more wiring than the all-in-one Voice Agent API, but it's still a direct socket, and you're still not deploying a framework. If you want a longer build, our &lt;a href="https://www.assemblyai.com/blog/real-time-transcription-python" rel="noopener noreferrer"&gt;real-time transcription in Python&lt;/a&gt; walkthrough goes step by step.&lt;/p&gt;

&lt;p&gt;A couple of details that trip people up: streaming uses speech_model (singular) — that's the opposite of the pre-recorded API, which takes a speech_models array for fallback routing. And universal-3-5-pro uses punctuation-based turn detection, so end_of_turn_confidence_threshold is a no-op on it. If your session is monolingual, pass language_code to commit the model to one language; leave it off to keep native code-switching.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does this hold up at enterprise scale?
&lt;/h2&gt;

&lt;p&gt;A fair pushback: "single API is fine for a demo, but does it survive production?" This is where collapsing the pipeline actually helps instead of hurts.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fewer moving parts, fewer failure modes.&lt;/strong&gt; A three-vendor cascade has three things that can rate-limit you, three that can have an incident, and three latency budgets that compound. One connection has one of each. When you're the one carrying the pager, that math matters more than any benchmark.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrency that scales with you.&lt;/strong&gt; Pay-as-you-go accounts start at 100 new streams per minute with no hard cap on concurrent sessions, and capacity scales up automatically — roughly a 10% bump every 60 seconds once you cross 70% utilization. AssemblyAI runs millions of hours of audio and 600M+ inference calls a month, with unlimited concurrency on the platform.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing you can forecast.&lt;/strong&gt; The Voice Agent API is a flat $4.50/hr that bundles STT, the LLM, and TTS. There's no separate per-vendor metering to model out, which makes capacity planning a spreadsheet instead of a guessing game. (Prefer to bring your own LLM and TTS? The streaming STT layer is billed separately on &lt;a href="https://www.assemblyai.com/pricing" rel="noopener noreferrer"&gt;its own pricing&lt;/a&gt; — $0.45/hr for Universal-3.5 Pro Realtime, with context and keyterm prompting included.)****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The compliance and residency boxes.&lt;/strong&gt; SOC 2 Type 2, EU data residency via streaming.eu.assemblyai.com at the same price as the US, and a self-hosted option if the audio can't leave your VPC. For teams handling protected health information (PHI), AssemblyAI is a business associate under HIPAA and offers a Business Associate Addendum (BAA) you can sign in minutes, without a sales call.&lt;/p&gt;

&lt;p&gt;None of this is a reason to &lt;em&gt;avoid&lt;/em&gt; a framework — plenty of large deployments run Pipecat or LiveKit happily. It's a reason to know that the framework-free path isn't a toy. For an honest look at where any voice agent architecture starts to strain, see &lt;a href="https://www.assemblyai.com/blog/where-voice-agent-stacks-start-showing-their-limits" rel="noopener noreferrer"&gt;where voice agent stacks start showing their limits&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When you should use Pipecat or LiveKit
&lt;/h2&gt;

&lt;p&gt;Skipping the framework is the right call for a large share of voice agents — especially phone agents, support bots, and anything where the job is a clean speech-to-speech loop. But not all of them. Reach for Pipecat or LiveKit when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You need real WebRTC features&lt;/strong&gt; — multi-party rooms, video alongside voice, screen share, or polished mobile and web client SDKs. This is LiveKit's core strength, and a WebSocket bridge won't replicate it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You want best-of-breed per stage&lt;/strong&gt; — a specific TTS voice, a particular LLM, or a custom turn-detection model, mixed and matched per use case. Frameworks are built for exactly this kind of swapping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You already run one in production.&lt;/strong&gt; If Pipecat or LiveKit is in your stack, don't rip it out. AssemblyAI ships first-class plugins for both, so you can run &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; as your STT inside the framework you already trust and get the same accuracy either way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You need fine-grained control&lt;/strong&gt; over the pipeline graph — custom processors, intricate branching, frame-level manipulation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The decision is less "framework vs. no framework" and more "how much of the pipeline do I actually need to control."&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing your path
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;You want to…&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ship a phone or web voice agent fast&lt;/td&gt;
&lt;td&gt;Voice Agent API (direct WebSocket)&lt;/td&gt;
&lt;td&gt;One socket for STT + LLM + TTS; flat $4.50/hr; no framework to deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keep your own LLM/TTS, upgrade only STT&lt;/td&gt;
&lt;td&gt;Streaming STT (direct WebSocket)&lt;/td&gt;
&lt;td&gt;Swap one piece; you keep the loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build multi-party rooms or add video&lt;/td&gt;
&lt;td&gt;LiveKit Agents + AssemblyAI plugin&lt;/td&gt;
&lt;td&gt;WebRTC features a socket bridge can't match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mix-and-match vendors with fine control&lt;/td&gt;
&lt;td&gt;Pipecat + AssemblyAI plugin&lt;/td&gt;
&lt;td&gt;Framework-grade pipeline control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add a phone number to any of the above&lt;/td&gt;
&lt;td&gt;Your telephony provider (e.g. Twilio)&lt;/td&gt;
&lt;td&gt;Carrier terminates SIP/PSTN, hands you a WebSocket&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The voice agent ecosystem defaulted to orchestration frameworks for a sound reason: for years, building an agent meant gluing three vendors together, and that glue is real engineering. Frameworks made the glue manageable.&lt;/p&gt;

&lt;p&gt;But the default quietly assumes the pipeline has to be multi-vendor. The moment one API owns speech-to-text, the LLM, and text-to-speech behind a single WebSocket — and your carrier hands you telephony audio over a WebSocket too — the orchestration layer doesn't get easier. It disappears. What's left is a socket on each side and a few lines to forward audio between them.&lt;/p&gt;

&lt;p&gt;So before you add Pipecat or LiveKit to your next voice agent, ask the two questions that actually matter: what's moving my audio, and is my pipeline really multi-vendor? If the answers are "a WebSocket" and "no," you've already got your architecture.&lt;/p&gt;

&lt;p&gt;Build your voice agent today&lt;/p&gt;

&lt;p&gt;Get a free API key and $50 in credits to ship your first voice agent — one WebSocket for speech-to-text, the LLM, and text-to-speech, at a flat $4.50/hr, with no framework to deploy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Do you need Pipecat or LiveKit to build a voice agent?
&lt;/h3&gt;

&lt;p&gt;No. Pipecat and LiveKit orchestrate a multi-vendor pipeline — they wire together separate speech-to-text, LLM, and text-to-speech services and manage turn-taking and transport between them. If a single API handles all three, like AssemblyAI's Voice Agent API does over one WebSocket, there's nothing left to orchestrate. Reach for a framework when you need WebRTC features like multi-party rooms or video, or fine-grained control over each pipeline stage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do voice agents use SIP or WebSocket for audio transport?
&lt;/h3&gt;

&lt;p&gt;Voice agents can move audio over WebSocket, WebRTC, or SIP, and the three solve different problems. AssemblyAI takes audio over a plain WebSocket — no media server or SFU required. For phone calls you don't implement SIP yourself: your telephony provider (such as Twilio) terminates the PSTN and SIP side and hands you the call's audio over a WebSocket, so the whole agent ends up being WebSocket on both ends.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between the AssemblyAI Voice Agent API and a framework like Pipecat or LiveKit?
&lt;/h3&gt;

&lt;p&gt;An orchestration framework connects separate STT, LLM, and TTS vendors and coordinates turn-taking, interruptions, and transport across them. The Voice Agent API replaces those three providers with one WebSocket connection — one bill, one set of logs, and one latency budget instead of three. Frameworks give you per-stage swappability and deep pipeline control; the single API gives you fewer moving parts. You don't have to choose blindly, either: AssemblyAI ships plugins for both Pipecat and LiveKit, so you can run Universal-3.5 Pro Realtime inside whichever framework you already use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can you build a Twilio voice agent without a framework?
&lt;/h3&gt;

&lt;p&gt;Yes. Twilio terminates the phone call and streams G.711 μ-law audio to your server over a Media Streams WebSocket, and your server bridges that audio to the Voice Agent API, which accepts μ-law (audio/pcmu) natively with zero transcoding. The whole bridge is a webhook that returns TwiML plus a few lines that forward audio in each direction — no SIP stack, no media server, and no orchestration framework. AssemblyAI maintains a complete &lt;a href="https://github.com/AssemblyAI/voice-agent-api-twilio-example" rel="noopener noreferrer"&gt;Twilio example repo&lt;/a&gt; with inbound calling, outbound calling, and tool handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does a voice agent without an orchestration framework scale to production?
&lt;/h3&gt;

&lt;p&gt;Yes. AssemblyAI's streaming platform starts at 100 new streams per minute on pay-as-you-go with no hard cap on concurrent sessions, and capacity scales up automatically under load. Collapsing speech-to-text, the LLM, and text-to-speech into one connection actually reduces operational risk at scale — there are fewer vendors to rate-limit you, fewer services that can have an incident, and fewer latency budgets that stack. Pricing is a flat $4.50/hr, with EU data residency and a self-hosted deployment option for teams with strict data requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I build a voice agent without Pipecat or LiveKit?
&lt;/h3&gt;

&lt;p&gt;Create an agent with a single REST call to define its prompt, voice, and tools, then open a WebSocket to wss://agents.assemblyai.com/v1/ws and bind it by agent_id in your first session.update message. From there you stream PCM audio in, play the agent's audio back out, and handle barge-in when the input.speech.started event fires. The same connection and event protocol work from a browser (using a short-lived token), from your own server, or over the phone through Twilio.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>voiceassistant</category>
      <category>python</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Prompting Claude to Build Voice Agents</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:25:37 +0000</pubDate>
      <link>https://dev.to/martschweiger/prompting-claude-to-build-voice-agents-2256</link>
      <guid>https://dev.to/martschweiger/prompting-claude-to-build-voice-agents-2256</guid>
      <description>&lt;p&gt;Ask Claude to build you a voice agent and you'll get working code back in about thirty seconds. That part is easy now. The gap between that first draft and something you'd actually put in front of customers — that's where the prompting craft lives.&lt;/p&gt;

&lt;p&gt;Most of the difference doesn't come from clever wording. It comes from what you put in front of the model before you ask for anything. A coding agent is only as good as its context, and for voice agents the two things that matter most — current API docs and a real sense of how your calls actually go — are exactly the two things the model doesn't have by default.&lt;/p&gt;

&lt;p&gt;So this isn't a list of magic prompts. It's four techniques that consistently move a voice agent from "demo that compiles" to "agent that ships":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ground the agent in live docs before you prompt anything.&lt;/li&gt;
&lt;li&gt;Pick your coding model deliberately — but know what actually moves the needle.&lt;/li&gt;
&lt;li&gt;Keep business context out of the first prompt.&lt;/li&gt;
&lt;li&gt;Feed the model real call transcripts as context.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're brand new to this, start with our walkthrough on &lt;a href="https://www.assemblyai.com/blog/how-to-vibe-code-a-voice-agent" rel="noopener noreferrer"&gt;how to vibe code a voice agent&lt;/a&gt; — it covers the basics of what comes out the other side. This post assumes you've done that once and want the version that holds up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ground the agent in live docs first
&lt;/h2&gt;

&lt;p&gt;Here's the failure mode that wrecks more voice agent builds than any other: the model writes confident, clean, completely outdated code.&lt;br&gt;
Voice APIs move fast. Endpoints get versioned, parameters get renamed, whole capabilities get added. AssemblyAI's streaming endpoint moved to wss://streaming.assemblyai.com/v3/ws a while back — but ask a coding agent working from training data alone, and there's a real chance it reaches for the retired v2/realtime URL, writes it perfectly, and hands you something that simply won't connect. The code looks right. It isn't. And you lose an hour learning that the hard way. The same goes for the newest parameters: features like agent_context and conversation carryover on Universal-3.5 Pro Realtime don't exist in any model's training data, so an ungrounded agent can't use the levers that most improve accuracy.&lt;/p&gt;

&lt;p&gt;The fix is to stop letting the model guess. Point it at the live docs before your first real prompt. AssemblyAI publishes a docs MCP server exactly for this — it lets a coding agent query current documentation instead of recalling stale training data. In &lt;a href="https://www.assemblyai.com/blog/why-assemblyais-voice-agent-api-is-designed-for-coding-agents" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;, add it once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add &lt;span class="nt"&gt;--transport&lt;/span&gt; http &lt;span class="nt"&gt;--scope&lt;/span&gt; user assemblyai-docs
https://mcp.assemblyai.com/docs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's also a skill that bundles the correct patterns and gotchas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add AssemblyAI/assemblyai-skill &lt;span class="nt"&gt;--global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when you ask Claude to wire up streaming transcription or the &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt;, it pulls the current endpoint, the current parameters, and the current auth flow. The hallucinated-endpoint class of bug mostly disappears. This one step does more for code quality than any amount of prompt tuning.&lt;/p&gt;

&lt;p&gt;It's not Claude-specific, either. Any MCP-aware tool — Cursor, Windsurf, Codex — can connect to the same server, and prompt-based builders like Lovable or v0 can take the docs URL (&lt;a href="https://www.assemblyai.com/docs/voice-agents/voice-agent-api" rel="noopener noreferrer"&gt;https://www.assemblyai.com/docs/voice-agents/voice-agent-api&lt;/a&gt;) right in the prompt. Which leads to the question everyone asks next.&lt;/p&gt;

&lt;p&gt;Point your coding agent at the docs&lt;/p&gt;

&lt;p&gt;Grab a free API key, connect the AssemblyAI docs MCP server, and let your agent write current, correct voice AI code instead of guessing from stale training data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Get your free API key&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude or Codex? The model matters less than you think
&lt;/h2&gt;

&lt;p&gt;Developers ask this constantly: Claude or Codex for building voice agents? It's a fair question, and the honest answer is the less satisfying one — for this task, the coding agent you pick matters less than whether it can see current docs.&lt;/p&gt;

&lt;p&gt;Both are strong. Codex is a capable coding agent and plenty of people ship good work with it. On the Claude side, if you want the most capable model for the kind of long, multi-step, agentic build a voice agent turns into, that's &lt;a href="https://www.assemblyai.com/blog/llm-use-cases" rel="noopener noreferrer"&gt;Claude Opus 4.8&lt;/a&gt; — run it at xhigh effort, which is the default Claude Code uses for coding and agentic work. If you care more about speed and cost on a tighter loop, Claude Sonnet 4.6 is the better trade. We won't pretend to hand you a head-to-head leaderboard here; the relevant comparison shifts every few weeks, and you should run your own task on both.&lt;/p&gt;

&lt;p&gt;But here's the thing that doesn't shift: a voice agent's correctness depends on getting transport, turn detection, and audio formats exactly right, and none of those live in any model's training data at the version you need. Both Claude and Codex write better voice-agent code with the AssemblyAI docs MCP server attached than either does without it. The grounding is the variable that actually controls your outcome. The model is a smaller one.&lt;/p&gt;

&lt;p&gt;For what it's worth, the Voice Agent API was built to work with Claude Code out of the box — one WebSocket, JSON in and out, no SDK to adopt — so if you're starting fresh and undecided, that's the path of least resistance. If you already live in Codex, point it at the same MCP server and keep moving. Either way, ground it first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keep business context out of the first prompt
&lt;/h2&gt;

&lt;p&gt;This one feels backwards, so stick with me.&lt;/p&gt;

&lt;p&gt;When people sit down to build a customer-facing agent, the instinct is to pour everything into the opening prompt: the product catalog, the refund policy, the escalation rules, the twelve edge cases sales keeps hitting. It feels thorough. It produces a mess.&lt;/p&gt;

&lt;p&gt;The problem is that you've asked the model to solve two different problems at once — get the real-time pipeline working &lt;em&gt;and&lt;/em&gt; encode your entire business — before either of you has confirmed the pipeline even runs. You get a sprawling first draft where the architecture and the domain logic are tangled together, and when something breaks (it will), you can't tell whether it's the WebSocket bridge or your refund rules.&lt;/p&gt;

&lt;p&gt;Split it into two phases. The first prompt is about the &lt;em&gt;engineering&lt;/em&gt;, and you should spec that fully and up front: the transport, the model, the latency target, barge-in handling, the STT→LLM→TTS loop. Be specific and complete here — modern models like Opus 4.8 do their best work when the technical task is well-defined in one shot.&lt;/p&gt;

&lt;p&gt;What you leave out is the &lt;em&gt;domain&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Build a phone voice agent on AssemblyAI&lt;span class="s1"&gt;'s Voice Agent API, bridged through Twilio over a WebSocket. Target
~1 second end-to-end latency, handle barge-in, and give me a clean way to plug in a system prompt and tools
later. Don'&lt;/span&gt;t add any business logic yet — I want a working loop I can call and talk to first.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That last sentence is the whole technique. You get back a skeleton you can actually dial and have a conversation with. You verify the hard part — the real-time plumbing — in isolation. &lt;em&gt;Then&lt;/em&gt; you layer in the business context, against a foundation you already trust. Which is exactly where the next technique comes in, because the best business context isn't written from memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feed Claude real call transcripts as context
&lt;/h2&gt;

&lt;p&gt;Once you've got a working skeleton, the most valuable thing you can hand the model isn't a document describing how your calls &lt;em&gt;should&lt;/em&gt; go. It's a pile of transcripts showing how they &lt;em&gt;actually&lt;/em&gt; go.&lt;/p&gt;

&lt;p&gt;Think about what a real call contains that a spec never will: the exact way customers say their account numbers, the product names they mangle, the three different ways people ask for a refund, where they interrupt, what they say when they're confused. That's the raw material for a system prompt that sounds right, a keyterms list that catches the entities your agent will really hear, and a tool list that matches what customers actually ask for. You can guess at all of it. Or you can read it off real calls.&lt;/p&gt;

&lt;p&gt;You almost certainly have the recordings sitting in your contact center or call platform already. Transcribe them with AssemblyAI's pre-recorded &lt;a href="https://www.assemblyai.com/blog/speech-to-text" rel="noopener noreferrer"&gt;speech-to-text&lt;/a&gt; API — &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; for the highest accuracy, with speaker labels so the agent's lines and the caller's lines stay separate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;assemblyai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;aai&lt;/span&gt;

&lt;span class="n"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ASSEMBLYAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Universal-3 Pro for accuracy; speaker_labels splits agent vs. caller.
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TranscriptionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;speech_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;universal-3-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;speaker_labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;transcriber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Transcriber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Transcribe a folder of real call recordings into plain-text transcripts.
&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;makedirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcripts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recordings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transcriber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recordings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Skipped &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="c1"&gt;# Speaker-labeled turns read the way a conversation actually flows.
&lt;/span&gt;    &lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utterances&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;out_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcripts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rsplit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Wrote &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now point Claude at the transcripts/ folder and let the real conversations do the work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Read these 50 real support call transcripts. From them: &lt;span class="o"&gt;(&lt;/span&gt;1&lt;span class="o"&gt;)&lt;/span&gt; draft a system prompt &lt;span class="k"&gt;for &lt;/span&gt;a voice agent that
handles these calls, &lt;span class="k"&gt;in &lt;/span&gt;the tone the reps actually use&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;2&lt;span class="o"&gt;)&lt;/span&gt; build a keyterms list of the product names,
account formats, and domain terms that show up, so streaming transcription gets them right&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;3&lt;span class="o"&gt;)&lt;/span&gt; list the
tools this agent would need to resolve these calls without a human.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output is grounded in evidence instead of imagination. The keyterms list contains the SKUs your customers actually say, not the ones you remembered. The system prompt mirrors how your best reps actually talk. And because &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; lets you update keyterms mid-session — and carries conversation context from turn to turn on its own — you can keep feeding that list back into the live agent as you learn more.&lt;br&gt;
This is the technique that's genuinely worth stealing. Most teams prompt their agent from a blank page. The teams whose agents sound human are reading from the transcript.&lt;/p&gt;

&lt;p&gt;Transcribe your own calls&lt;/p&gt;

&lt;p&gt;Run a real call recording through Universal-3 Pro in the playground and see the transcript accuracy — entities, speaker labels, and all — that your agent's context will be built on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The two-phase workflow, end to end
&lt;/h2&gt;

&lt;p&gt;Put the four techniques in order and you get a repeatable way to build:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Phase&lt;/th&gt;
&lt;th&gt;What you prompt&lt;/th&gt;
&lt;th&gt;What you feed the model&lt;/th&gt;
&lt;th&gt;What you get&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;AssemblyAI docs MCP server + skill&lt;/td&gt;
&lt;td&gt;An agent that writes current, correct API code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phase 1 — skeleton&lt;/td&gt;
&lt;td&gt;The full engineering spec; &lt;strong&gt;no business logic&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Live docs&lt;/td&gt;
&lt;td&gt;A working STT→LLM→TTS loop you can call and verify&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phase 2 — ground it&lt;/td&gt;
&lt;td&gt;"Build the prompt, keyterms, and tools from these"&lt;/td&gt;
&lt;td&gt;Real call transcripts&lt;/td&gt;
&lt;td&gt;A system prompt, keyterms, and tools grounded in real calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tune&lt;/td&gt;
&lt;td&gt;Targeted fixes as you test&lt;/td&gt;
&lt;td&gt;Fresh transcripts of your agent's own calls&lt;/td&gt;
&lt;td&gt;An agent that keeps getting more accurate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice the model brand never appears in that table. Claude or Codex, Opus 4.8 or Sonnet 4.6 — the workflow is the same, and the workflow is what carries the quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;The interesting shift in building voice agents isn't that AI can write the code. It's that the bottleneck moved. It used to be syntax and SDK archaeology. Now it's context — and context is something you control completely.&lt;br&gt;
Stale docs and remembered business rules give you a demo. Live docs and real transcripts give you an agent. Whichever model you prompt, that's the lever. Ground it, spec the engineering cleanly, and let your own calls write the hard part.&lt;/p&gt;

&lt;p&gt;Build your voice agent today&lt;/p&gt;

&lt;p&gt;Get a free API key and $50 in credits, point your coding agent at the AssemblyAI docs, and ship your first voice agent on a single WebSocket — no SDK required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Get your free API key&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can you build a voice agent just by prompting Claude?
&lt;/h3&gt;

&lt;p&gt;Yes. Claude — through Claude Code — can scaffold a complete voice agent on AssemblyAI's Voice Agent API, including the speech-to-text, LLM, and text-to-speech loop and the transport bridge. The quality depends far more on context than on wording: give it current documentation through the AssemblyAI docs MCP server and ground it in real call transcripts, rather than letting it work from training data alone. Spec the engineering task first, get a working loop, then layer in your business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I stop an AI coding agent from writing outdated AssemblyAI code?
&lt;/h3&gt;

&lt;p&gt;Connect the agent to the AssemblyAI docs MCP server so it queries current documentation instead of recalling stale training data. In Claude Code, run claude mcp add --transport http --scope user assemblyai-docs &lt;a href="https://mcp.assemblyai.com/docs" rel="noopener noreferrer"&gt;https://mcp.assemblyai.com/docs&lt;/a&gt;, and optionally add the skill with npx skills add AssemblyAI/assemblyai-skill --global. This eliminates the most common failure — confidently written code that uses a retired endpoint or a renamed parameter, or that misses newer capabilities like agent_context on Universal-3.5 Pro Realtime. Any MCP-aware tool, including Cursor and Codex, can use the same server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Claude or Codex — which is better for building voice agents?
&lt;/h3&gt;

&lt;p&gt;Both are capable coding agents, and you should run your own task on each rather than trust a leaderboard that shifts every few weeks. Among Claude models, Opus 4.8 is the most capable for the long, multi-step build a voice agent becomes — run it at xhigh effort, the default Claude Code uses for coding work — while Sonnet 4.6 trades some capability for speed and cost. For voice agents specifically, the bigger lever is whether the agent can see current API docs: both Claude and Codex write better code with the AssemblyAI docs MCP server attached, and the Voice Agent API works with Claude Code out of the box.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I use real call recordings to build a better voice agent?
&lt;/h3&gt;

&lt;p&gt;Transcribe your existing call recordings with AssemblyAI's pre-recorded speech-to-text API — Universal-3 Pro with speaker labels turned on — then hand the transcripts to your coding agent as context. Real calls reveal the actual account-number formats, product names, phrasing, and turn-taking your agent will face, so the system prompt, keyterms list, and tool definitions get built from evidence instead of guesswork. Because Universal-3.5 Pro Realtime lets you update keyterms mid-session, you can keep feeding what you learn back into the live agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I include my business logic in the first prompt?
&lt;/h3&gt;

&lt;p&gt;No — separate the engineering prompt from the domain prompt. Spec the full technical task (transport, model, latency target, barge-in handling) up front and get a working speech-to-text → LLM → text-to-speech loop you can call and verify first. Add the business context — system prompt, keyterms, tools — in a second phase, ideally generated from real call transcripts. Dumping everything into the opening prompt tangles the architecture with domain logic and makes it much harder to tell what broke when something does.&lt;/p&gt;

&lt;h3&gt;
  
  
  What do I need to prompt Claude to build a voice agent on AssemblyAI?
&lt;/h3&gt;

&lt;p&gt;You need a free AssemblyAI API key, a coding agent (Claude Code is the smoothest path), and the AssemblyAI docs MCP server connected so the agent writes current code. From there, prompt for the engineering skeleton first, then ground it with transcripts of your real calls. The Voice Agent API is a single WebSocket with one flat rate and no SDK to adopt, which keeps the code the agent generates simple enough to verify quickly.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>voiceassistant</category>
      <category>python</category>
    </item>
    <item>
      <title>One Parameter, 20% Fewer Missed Medical Entities</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:25:11 +0000</pubDate>
      <link>https://dev.to/martschweiger/one-parameter-20-fewer-missed-medical-entities-5fl2</link>
      <guid>https://dev.to/martschweiger/one-parameter-20-fewer-missed-medical-entities-5fl2</guid>
      <description>&lt;p&gt;You already have a pipeline. You're already sending clinical audio to AssemblyAI. So this post isn't going to teach you what a transcript is or why drug names are hard. You know.&lt;/p&gt;

&lt;p&gt;Here's the part that matters: there's a single config parameter that gets you about 20% fewer missed medical entities on the exact same audio you're already sending. No new model. No re-integration. No new API key. One line.&lt;/p&gt;

&lt;p&gt;That parameter is domain: "medical-v1", and it turns on &lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let me show you what changes, what doesn't, and what it costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The one parameter&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's your existing call to &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt;, with Medical Mode added:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;assemblyai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;aai&lt;/span&gt;

&lt;span class="n"&gt;transcriber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Transcriber&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TranscriptionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="n"&gt;speech_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;universal-3-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medical-v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# the one parameter
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transcriber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clinical-audio.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. That's the diff. If you were already on Universal-3 Pro, you add one field to the config object you already build. Everything downstream of the transcript—your diarization handling, your redact_pii step, your storage—stays exactly where it is.&lt;/p&gt;

&lt;p&gt;Notice what you didn't do. You didn't swap to a different model family. Medical Mode runs on top of Universal-3 Pro and Universal-3.5 Pro Realtime, so the decoder, the language coverage, the entity accuracy you already rely on—all still there. You're tuning the model you're already running, not replacing it.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What actually changes in the transcript&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Universal-3 Pro is already strong on entities—drug names, proper nouns, rare words—because it uses an LLM-based decoder rather than a classic acoustic-only approach. Medical Mode pushes that further specifically for clinical vocabulary. The result: a 3.2% Missed Entity Rate (MER), roughly 20% fewer missed medical entities than Universal-3 Pro alone, and the lowest MER across the providers we benchmarked against—Deepgram, Speechmatics, AWS, and Google. The full numbers live on &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;/benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But abstract percentages don't tell you what to expect in your own output. So here's a concrete picture.&lt;br&gt;
&lt;strong&gt;Illustrative example.&lt;/strong&gt; The transcript pair below is constructed to show the &lt;em&gt;kind&lt;/em&gt; of errors Medical Mode catches—it is not measured data. The measured numbers come from &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;/benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A clinician dictates a short medication summary. Without Medical Mode, a general model might produce:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="s2"&gt;"Patient continues on metformin 500 milligrams twice daily. Started hydrochlorothiazide for the
hypertension, and we'll reassess after the echocardiogram."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…but the errors a general model tends to make cluster exactly where it hurts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="s2"&gt;"Patient continues on Metro Min 500 milligrams twice daily. Started hydrocortisone for the hypertension, and
we'll reassess after the echo cardiogram."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read that second version as a downstream system would. "Metro Min" isn't a drug. "Hydrocortisone" is a real drug—just the wrong one, and a dangerous swap for a thiazide diuretic. "Echo cardiogram" splits a procedure into two tokens your coding logic won't recognize.&lt;/p&gt;

&lt;p&gt;With Medical Mode on, the same audio resolves to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="s2"&gt;"Patient continues on metformin 500 milligrams twice daily. Started hydrochlorothiazide for the
hypertension, and we'll reassess after the echocardiogram."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dosage was never the hard part. The entities were. That's the whole point of measuring missed entities separately from raw word error—a model can nail the filler words and still hand you the wrong drug.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to see it on your own audio? Activate Medical Mode in your dashboard and re-run a file you've already transcribed.&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Open your dashboard →&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What does not change&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is the part I want to be loud about, because "improve medical accuracy" usually implies a migration project. Here it doesn't.&lt;/p&gt;

&lt;p&gt;No model swap. You stay on Universal-3 Pro or Universal-3.5 Pro Realtime. Medical Mode is a setting on those models, not a separate endpoint.&lt;br&gt;
No re-integration. Same SDK, same Transcriber(), same response shape. Your parsing code doesn't move.&lt;br&gt;
No new API key. The key you're using right now works. There's nothing to provision.&lt;br&gt;
No language regression. Medical Mode supports English, Spanish, German, and French—both pre-recorded and streaming. If your traffic is multilingual, the clinical tuning follows the language.&lt;/p&gt;

&lt;p&gt;If you've ever scoped a "switch transcription providers for better medical accuracy" ticket, you know it usually runs weeks. This is a config change you can ship in an afternoon and roll back just as fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Async and streaming, same flag&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The example above is pre-recorded. Streaming works the same way—you set domain: "medical-v1" on &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; and keep your existing socket handling.&lt;/p&gt;

&lt;p&gt;This matters for ambient clinical use. If you're transcribing a live encounter—an ambient scribe, a telehealth visit, a nurse triage line—you get the same entity tuning at streaming latency. Universal-3.5 Pro Realtime's end-of-turn detection reads tonality, pacing, and rhythm rather than silence alone and lands around 300ms, and turning on Medical Mode doesn't change how you consume partial and final transcripts. You're not trading speed for accuracy here; you're getting the clinical vocabulary handling inline.&lt;/p&gt;

&lt;p&gt;One nuance worth knowing if you use &lt;a href="https://www.assemblyai.com/docs" rel="noopener noreferrer"&gt;keyterms prompting&lt;/a&gt;: streaming supports up to 100 keyterms for free, mid-stream. Medical Mode and keyterms aren't mutually exclusive—Medical Mode handles the broad clinical vocabulary, and you can still prompt the handful of facility-specific terms (a local formulary name, a clinic's procedure shorthand) that no general medical model would know.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What it costs&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Medical Mode is a $0.15/hr add-on. It stacks on top of your base model price:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Universal-3 Pro + Medical Mode = $0.36/hr (the $0.21/hr async base plus the add-on)&lt;/li&gt;
&lt;li&gt;Universal-3.5 Pro Realtime + Medical Mode = $0.60/hr (the $0.45/hr streaming base plus the add-on)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full breakdown on the &lt;a href="https://www.assemblyai.com/pricing" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's how I'd think about that fifteen cents. The cost of a missed entity isn't the audio minute—it's the downstream correction. A transcript that turns hydrochlorothiazide into hydrocortisone doesn't just need a re-listen; in a clinical workflow it can trigger a manual review, a clinician callback, or worse. Cutting missed entities by ~20% changes how much human QA your pipeline needs. For most teams running clinical audio at volume, that math closes fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;A note on PHI handling&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Since you're processing clinical audio, the compliance question comes up. AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA). Medical Mode runs on that same BAA-eligible infrastructure, BAA included.&lt;/p&gt;

&lt;p&gt;Practically, that means the tools you'd reach for—diarization to separate clinician from patient, and redact_pii to strip PHI from stored transcripts—are available alongside Medical Mode, on the same models. You're not assembling a separate compliance stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Do I have to migrate off my current model to use Medical Mode?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;No. Medical Mode is a domain: "medical-v1" setting on Universal-3 Pro and Universal-3.5 Pro Realtime. If you're already on either, you add one parameter—no model swap, no re-integration, no new key.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Does Medical Mode work for live transcription?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Yes. It works on Universal-3.5 Pro Realtime with the same flag, so ambient scribes and telehealth pipelines get clinical entity tuning at streaming latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Which languages does Medical Mode support?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;**** English, Spanish, German, and French, for both pre-recorded and streaming audio.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How much does it add to my bill?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;$0.15/hr on top of the base model. That's $0.36/hr for Universal-3 Pro async and $0.60/hr for Universal-3.5 Pro Realtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Can I use keyterms prompting and Medical Mode together?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Yes. Medical Mode covers broad clinical vocabulary; keyterms cover your facility-specific terms—up to 100 free mid-stream on streaming, up to 1,000 on async for an additional $0.05/hr.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>healthcare</category>
      <category>speechtotext</category>
    </item>
    <item>
      <title>How We Measure Medical Transcription: MER vs. WER</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:25:05 +0000</pubDate>
      <link>https://dev.to/martschweiger/how-we-measure-medical-transcription-mer-vs-wer-58n1</link>
      <guid>https://dev.to/martschweiger/how-we-measure-medical-transcription-mer-vs-wer-58n1</guid>
      <description>&lt;p&gt;Word error rate has a flattering quality. It rolls every mistake into one clean percentage, and a clean percentage is easy to put on a slide.&lt;/p&gt;

&lt;p&gt;It's also lying to you about medical audio.&lt;/p&gt;

&lt;p&gt;Here's the problem in one sentence: WER treats every word as equal. Missing an "um" costs exactly the same as turning "hydrochlorothiazide" into "hydrocortisone." A filler word and a beta-blocker, weighted identically.&lt;/p&gt;

&lt;p&gt;That's fine for a podcast. It's indefensible for a clinical transcript. So let me walk through why WER misleads evaluators, what we measure instead, and how the numbers actually shake out.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What WER counts, and what it ignores&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;WER is substitutions plus insertions plus deletions, divided by the number of reference words. Every error in that numerator carries weight one. The metric has no concept of which words matter.&lt;/p&gt;

&lt;p&gt;Think about what a medical conversation actually contains. The vast majority of words are connective tissue—"the," "and," "patient," "we'll," "okay." A handful of words carry the entire clinical meaning: the drug, the dose, the diagnosis, the procedure. Maybe 5% of the tokens hold 95% of the risk.&lt;/p&gt;

&lt;p&gt;WER averages across all of them. So a model can flub the 5% that matter and still post a great-looking score, because it nailed the 95% that don't. We've written before about how &lt;a href="https://www.assemblyai.com/blog/word-error-rate-is-broken" rel="noopener noreferrer"&gt;word error rate is broken&lt;/a&gt; and how &lt;a href="https://www.assemblyai.com/blog/new-word-error-rate-wer-benchmark" rel="noopener noreferrer"&gt;your WER benchmark might be lying to you&lt;/a&gt;—this is the same disease in its most dangerous form.&lt;br&gt;
Consider two transcription errors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"uh, the patient" → "the patient" (dropped a filler)&lt;/li&gt;
&lt;li&gt;"metoprolol" → "metformin" (a cardiac drug became a diabetes drug)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To WER, both are one error. One of them is noise. The other could change a treatment decision. A metric that can't tell them apart is the wrong metric for healthcare.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Missed Entity Rate, defined&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;So we measure the thing that actually matters: how often the clinically meaningful words come out wrong.&lt;br&gt;
Missed Entity Rate (MER) is the percentage of medical entities not correctly transcribed. By "entities" we mean the words a clinician or a downstream coding system depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drug names—generic and brand&lt;/li&gt;
&lt;li&gt;Diagnoses and conditions&lt;/li&gt;
&lt;li&gt;Procedures&lt;/li&gt;
&lt;li&gt;Dosages and units&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MER ignores whether the model dropped an "um." It asks a narrower, harder question: when a drug name appeared in the audio, did the transcript get it right? When a procedure was named, did it survive intact, or did "echocardiogram" become "echo cardiogram" and fall out of your entity extraction?&lt;/p&gt;

&lt;p&gt;This is the metric that maps to clinical risk. A model can sit at a respectable WER and still have a terrible MER—and if you're building anything that touches patient care, MER is the number you should be staring at.****&lt;/p&gt;

&lt;p&gt;Want to see this on your own audio instead of ours?&lt;/p&gt;

&lt;p&gt;Run the benchmark on representative files from your domain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The benchmark&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's where the two metrics part ways. We benchmarked &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; with &lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt; against the providers evaluators actually shortlist, measuring both MER and WER. Lower is better for each; MER is the share of entities not correctly transcribed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider / configuration&lt;/th&gt;
&lt;th&gt;MER&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI Universal-3 Pro w/ Medical Mode&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;td&gt;5.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speechmatics Enhanced Medical&lt;/td&gt;
&lt;td&gt;4.7%&lt;/td&gt;
&lt;td&gt;6.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram Nova-3 Medical&lt;/td&gt;
&lt;td&gt;8.7%&lt;/td&gt;
&lt;td&gt;5.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Transcribe Medical&lt;/td&gt;
&lt;td&gt;24.4%&lt;/td&gt;
&lt;td&gt;12.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Medical Conversation&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Read the table for a second before the conclusion, because the rows tell the story better than I can.&lt;/p&gt;

&lt;p&gt;Look at Deepgram Nova-3 Medical: 5.9% WER, 8.7% MER. If you'd evaluated on WER alone, you'd see a single point of difference from the top of the table and shrug. But its MER is more than double ours—it's missing entities at almost three times the rate while looking nearly identical on the headline metric. That's WER's flattery in action.&lt;br&gt;
AWS Transcribe Medical makes the gap impossible to miss: 12.9% WER, but a 24.4% MER. Nearly a quarter of medical entities not transcribed correctly. The WER alone wouldn't scream that loudly.&lt;/p&gt;

&lt;p&gt;Universal-3 Pro with Medical Mode posts the lowest MER in the set at 3.2%. That's the claim that matters here, and it's an entity claim, not a word-count claim. Full methodology and the rest of the numbers are on &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;/benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How we measure it, at a high level&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You shouldn't take a benchmark on faith, so here's the shape of how MER is computed.&lt;/p&gt;

&lt;p&gt;Start with reference transcripts—human-verified ground truth for real clinical-style audio. Identify the medical entities in each reference: the drugs, diagnoses, procedures, and dosages. Run each model over the same audio. Then, for every reference entity, check whether the model's output contains a correct match. The share that don't match is the MER.&lt;/p&gt;

&lt;p&gt;The detail that does the work is entity alignment. "Echocardiogram" rendered as "echo cardiogram" is two tokens where the reference has one, so a naive comparison can misfire. Robust entity matching has to handle tokenization differences, casing, and generic-versus-brand naming, so that you're scoring clinical correctness rather than punishing formatting. This is the same care we describe in &lt;a href="https://www.assemblyai.com/blog/how-to-evaluate-speech-recognition-models" rel="noopener noreferrer"&gt;how to evaluate speech recognition models&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It's worth saying plainly: no public benchmark is your benchmark. Our test set reflects our distribution of accents, recording conditions, specialties, and drug frequencies. Yours will differ. The point of publishing MER isn't "trust our number"—it's "measure the right thing, then measure it on your audio."&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why this isn't just an AssemblyAI talking point&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I'd make the MER argument even if we lost the benchmark, because the alternative is worse. An industry that evaluates medical transcription on WER alone is optimizing models to be confident and wrong about exactly the words that carry risk. A model trained and tuned to minimize WER has every incentive to get common words perfect and treat a rare drug name as a rounding error.&lt;/p&gt;

&lt;p&gt;That's backwards. The rare words are the whole job. If you want to understand how far speech-to-text has come on the words that aren't rare, we cover that in &lt;a href="https://www.assemblyai.com/blog/how-accurate-speech-to-text" rel="noopener noreferrer"&gt;how accurate is speech-to-text in 2026&lt;/a&gt;—but general accuracy and medical entity accuracy are different problems, and conflating them is how teams ship clinical tools that look fine in a demo and fail on the third encounter.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What's the difference between MER and WER?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;WER measures the share of all words transcribed incorrectly, weighting every word equally. MER measures only the share of medical entities—drug names, diagnoses, procedures, dosages—transcribed incorrectly. A model can have a low WER and a high MER if it gets common words right and clinical terms wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why is WER a poor metric for medical transcription?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&amp;nbsp;Because it treats a dropped "um" the same as a wrong drug name. In clinical audio, a small fraction of words carry almost all the meaning and risk, and WER averages them away. See &lt;a href="https://www.assemblyai.com/blog/word-error-rate-is-broken" rel="noopener noreferrer"&gt;word error rate is broken&lt;/a&gt; for the longer argument.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Which model has the lowest MER?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In our benchmark, Universal-3 Pro with Medical Mode posts the lowest MER at 3.2%, ahead of Deepgram, Speechmatics, AWS, and Google. The full table is on &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;/benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Can I reproduce this on my own audio?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Yes—and you should. Public benchmarks reflect the publisher's data distribution, not yours. Run your own representative clinical files through the &lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;playground&lt;/a&gt; and compare entities against your ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Does a low MER mean I can skip human review?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;No. A lower MER means fewer entities to catch and less QA burden, but clinical workflows still warrant human verification. The value of MER is telling you &lt;em&gt;where&lt;/em&gt; the residual risk lives—in the entities—so you can review the right things.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>healthcare</category>
      <category>speechtotext</category>
    </item>
    <item>
      <title>Wrong Drug Name In, Wrong SOAP Note Out: Error Propagation</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:24:40 +0000</pubDate>
      <link>https://dev.to/martschweiger/wrong-drug-name-in-wrong-soap-note-out-error-propagation-2mcl</link>
      <guid>https://dev.to/martschweiger/wrong-drug-name-in-wrong-soap-note-out-error-propagation-2mcl</guid>
      <description>&lt;p&gt;A clinical AI pipeline looks deceptively simple on a whiteboard. Audio goes in. A speech-to-text model turns it into text. An LLM reads that text and writes the SOAP note, pulls the medication list, maybe suggests ICD-10 codes for billing. Three boxes, two arrows, done.&lt;/p&gt;

&lt;p&gt;The problem lives in the arrows.&lt;/p&gt;

&lt;p&gt;Every stage in that pipeline inherits whatever the stage before it produced. Not "mostly inherits." Inherits completely. The LLM that writes your assessment never hears the audio. It never sees the waveform. It sees text—and it treats that text as ground truth, because it has no other choice. If the transcription layer hands it the word "hydrocortisone" when the clinician said "hydrochlorothiazide," the model doesn't flag the discrepancy. It can't. There's nothing to compare against. It writes a fluent, confident, clinically coherent note about the wrong drug.&lt;/p&gt;

&lt;p&gt;That's the failure mode nobody benchmarks for. Garbage in, fluent garbage out.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The error doesn't stay where it started&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's what makes entity errors in clinical pipelines worse than they look at first glance: they don't fail loudly. A misrecognized drug name doesn't produce a typo or a [INAUDIBLE] tag. It produces a different real word that the downstream model accepts without suspicion.&lt;/p&gt;

&lt;p&gt;Walk through one swap. A physician dictates a follow-up for a patient on &lt;strong&gt;hydrochlorothiazide&lt;/strong&gt; for hypertension. The STT layer transcribes it as &lt;strong&gt;hydrocortisone&lt;/strong&gt;—a real drug, spelled correctly, phonetically adjacent, and completely wrong therapeutically. One's a thiazide diuretic for blood pressure. The other's a corticosteroid.&lt;/p&gt;

&lt;p&gt;Now watch the blast radius.&lt;/p&gt;

&lt;p&gt;The medication list updates to hydrocortisone. The LLM writing the assessment reads "patient on hydrocortisone" and reasons accordingly—maybe it notes steroid considerations, maybe it adjusts the plan around a drug the patient never took. The coding step downstream maps to the wrong therapeutic class. The billing record reflects a medication that isn't in the chart for any clinical reason. And every one of those artifacts reads cleanly. A reviewer skimming the SOAP note sees fluent, plausible prose. Nothing screams "error" because the error has been laundered into grammatical, confident English at every stage.&lt;/p&gt;

&lt;p&gt;Try another. The clinician says &lt;strong&gt;metformin&lt;/strong&gt;—a first-line drug for type 2 diabetes. The transcript says &lt;strong&gt;metronidazole&lt;/strong&gt;, an antibiotic. Suddenly the note implies a diabetic patient is being managed with an antimicrobial, the problem list drifts, and the coding logic follows the transcript off a cliff. The LLM can't recover metformin. It was never given metformin to recover.&lt;/p&gt;

&lt;p&gt;This is the part that gets underappreciated: the LLM isn't the weak link here. A frontier model writing the note might be doing flawless work—faithfully, accurately summarizing the text it received. The accuracy ceiling for the entire pipeline was set upstream, at the transcription layer, before the LLM ever woke up.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why fixing it downstream doesn't work&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The intuitive instinct is to patch this at the LLM stage. Add a verification prompt. Ask the model to double-check medications. Build a reconciliation step.&lt;/p&gt;

&lt;p&gt;It doesn't hold up, and the reason is information-theoretic, not a matter of prompt engineering. You cannot reconstruct information that was destroyed before it reached you. When the STT layer collapsed "hydrochlorothiazide" into "hydrocortisone," the signal that distinguished those two words—the acoustic detail in the audio—is gone by the time the text exists. The LLM has access to the wrong word and a high-quality language model's prior that the wrong word is perfectly reasonable in context. Asking it to catch the error is asking it to detect a problem it has no evidence for.&lt;/p&gt;

&lt;p&gt;You could feed the audio to the LLM directly and skip the transcript. Some teams will. But for the vast majority of production clinical pipelines—ambient scribes, dictation workflows, coding automation—text is the interface, the audit trail, and the thing humans actually review. The transcript is load-bearing. Which means transcript accuracy, specifically &lt;em&gt;entity&lt;/em&gt; accuracy, is the highest-leverage place in the entire system to prevent propagation.&lt;/p&gt;

&lt;p&gt;Fix it at the source and everything downstream gets cheaper, safer, and more trustworthy. Fix it downstream and you're building increasingly elaborate machinery to compensate for a problem that should never have entered the pipeline.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to see how entity accuracy holds up on your own clinical audio?&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;Talk to our team&lt;/a&gt; or test a file in the &lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;playground&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why WER isn't the metric that matters here&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For years the industry has graded speech-to-text on word error rate. It's a useful number. It's also the wrong number for clinical work, and the reason traces directly back to error propagation.&lt;/p&gt;

&lt;p&gt;WER treats every word as equally important. Miss "the," miss "hydrochlorothiazide"—both count as one error against the same denominator. A model can post an impressive WER while quietly fumbling the exact words that propagate downstream. We've written before about &lt;a href="https://www.assemblyai.com/blog/word-error-rate-is-broken" rel="noopener noreferrer"&gt;why WER is broken&lt;/a&gt; and &lt;a href="https://www.assemblyai.com/blog/new-word-error-rate-wer-benchmark" rel="noopener noreferrer"&gt;why your WER benchmark might be lying to you&lt;/a&gt;—the short version is that a single aggregate number hides the distribution of &lt;em&gt;which&lt;/em&gt; words get missed.&lt;br&gt;
In a clinical pipeline, the distribution is everything. The words that matter are drug names, dosages, proper nouns, lab values, anatomical terms—the entities. A model that nails common words and drops entities will look fine on WER and fail catastrophically on the thing you actually care about.&lt;/p&gt;

&lt;p&gt;That's why we track missed entity rate (MER) as a first-class metric, not an afterthought. MER measures how often the model drops or mangles the clinical entities that drive every downstream decision. If you're only watching WER, you're watching the wrong gauge. Our full methodology lives in &lt;a href="https://www.assemblyai.com/blog/how-to-evaluate-speech-recognition-models" rel="noopener noreferrer"&gt;how to evaluate speech recognition models&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;What Medical Mode actually does about it&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; already delivers best-in-market entity accuracy on drug names, proper nouns, and rare words. Medical Mode tightens that further for clinical audio specifically.&lt;/p&gt;

&lt;p&gt;Turning it on is one parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;import assemblyai as aai

transcriber &lt;span class="o"&gt;=&lt;/span&gt; aai.Transcriber&lt;span class="o"&gt;()&lt;/span&gt;
config &lt;span class="o"&gt;=&lt;/span&gt; aai.TranscriptionConfig&lt;span class="o"&gt;(&lt;/span&gt;
&lt;span class="nv"&gt;speech_models&lt;/span&gt;&lt;span class="o"&gt;=[&lt;/span&gt;&lt;span class="s2"&gt;"universal-3-pro"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;,
&lt;span class="nv"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"medical-v1"&lt;/span&gt;,
&lt;span class="o"&gt;)&lt;/span&gt;
transcript &lt;span class="o"&gt;=&lt;/span&gt; transcriber.transcribe&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"audio.wav"&lt;/span&gt;, config&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That domain="medical-v1" flag works on both Universal-3 Pro and &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt;, across English, Spanish, German, and French.&lt;/p&gt;

&lt;p&gt;The numbers, measured on our &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;benchmarks&lt;/a&gt;: Universal-3 Pro with Medical Mode posts a 3.2% MER—the lowest across every provider we tested, and roughly 20% fewer missed medical entities than Universal-3 Pro alone. Here's how that stacks up:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;MER&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI Universal-3 Pro w/ Medical Mode&lt;/td&gt;
&lt;td&gt;3.2%&lt;/td&gt;
&lt;td&gt;5.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;3.6%&lt;/td&gt;
&lt;td&gt;5.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speechmatics Enhanced Medical&lt;/td&gt;
&lt;td&gt;4.7%&lt;/td&gt;
&lt;td&gt;6.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram Nova-3 Medical&lt;/td&gt;
&lt;td&gt;8.7%&lt;/td&gt;
&lt;td&gt;5.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Transcribe Medical&lt;/td&gt;
&lt;td&gt;24.4%&lt;/td&gt;
&lt;td&gt;12.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Look at the AWS row. A 24.4% MER means roughly one in four clinical entities gets missed or mangled. In a pipeline that propagates every error downstream, that's not a transcription model—it's an error generator with a confident downstream LLM amplifying its mistakes into polished, wrong documentation.&lt;br&gt;
Pricing is a $0.15/hr add-on on top of Universal-3 Pro's $0.21/hr, so $0.36/hr all in. Full details on the &lt;a href="https://www.assemblyai.com/pricing" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The rest of a pipeline you can actually trust&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Entity accuracy is the highest-leverage lever, but it's not the only one. Two more pieces matter for a clinical pipeline you'd put your name on.&lt;/p&gt;

&lt;p&gt;First, &lt;a href="https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work" rel="noopener noreferrer"&gt;speaker diarization&lt;/a&gt;. In an exam-room recording, who said what changes the meaning of the note. A symptom the patient reports and an instruction the clinician gives are different clinical objects, and diarization keeps them separated before the LLM ever tries to structure them. Get the speaker attribution wrong and you've introduced a different flavor of the same propagation problem.&lt;/p&gt;

&lt;p&gt;Second, PII redaction. A trustworthy pipeline doesn't just transcribe accurately—it controls what flows downstream. Redacting protected identifiers at the transcript layer means the LLM and every system after it operate on the minimum data they need.&lt;/p&gt;

&lt;p&gt;On the infrastructure side: AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA). It runs on BAA-eligible infrastructure with a BAA included.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Why can't a good LLM just catch a wrong drug name?
&lt;/h2&gt;

&lt;p&gt;Because it never receives the audio. The LLM only sees the transcript. Once the speech-to-text layer has substituted one real drug name for another, the acoustic information that distinguished them is gone. The model has no evidence anything is wrong, so it writes a fluent note around the incorrect entity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Isn't a low word error rate enough to trust a clinical transcript?
&lt;/h2&gt;

&lt;p&gt;No. WER weights every word equally, so a model can score well overall while consistently missing the drug names, dosages, and proper nouns that drive downstream decisions. Missed entity rate measures the words that actually propagate. That's why we report MER alongside WER.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do I turn on Medical Mode?
&lt;/h2&gt;

&lt;p&gt;Set one parameter: domain="medical-v1". It works on Universal-3 Pro and Universal-3.5 Pro Realtime, in English, Spanish, German, and French.&lt;/p&gt;

&lt;h2&gt;
  
  
  What does Medical Mode cost?
&lt;/h2&gt;

&lt;p&gt;&amp;nbsp;It's a $0.15/hr add-on. Combined with Universal-3 Pro's $0.21/hr, the total is $0.36/hr.&lt;/p&gt;

&lt;h2&gt;
  
  
  What if my audio has rare or facility-specific drug names?
&lt;/h2&gt;

&lt;p&gt;&amp;nbsp;Use keyterms prompting to load them—up to 1,000 terms on async, up to 100 on streaming, updatable mid-stream. It's the mechanism for specialized vocabulary that even a strong general model might not weight heavily enough.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>healthcare</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Veterinary Transcription API: Species, Breeds &amp; Vet Drugs</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:24:33 +0000</pubDate>
      <link>https://dev.to/martschweiger/veterinary-transcription-api-species-breeds-vet-drugs-3d1e</link>
      <guid>https://dev.to/martschweiger/veterinary-transcription-api-species-breeds-vet-drugs-3d1e</guid>
      <description>&lt;p&gt;A vet's exam room is one of the hardest audio environments in medicine, and almost nobody builds for it.&lt;/p&gt;

&lt;p&gt;Think about what a general-purpose speech-to-text model has to handle. A clinician dictating over a barking dog. A tech reading back a weight-based dose. An owner answering questions from across the room. Drug names that overlap with human pharmacology but get used at different doses for different reasons. Breed names that sound like nothing in any training corpus built on human conversation. "Brachycephalic." "Maine Coon." "Enrofloxacin."&lt;/p&gt;

&lt;p&gt;Most transcription APIs were trained on podcasts, call-center audio, and meetings. Drop a veterinary exam on them and they fall apart exactly where it matters—on the entities. The good news is that the same vocabulary-accuracy engine we built for human clinical work handles veterinary transcription remarkably well. Works for doctors, works for you. Let me show you why, and where the honest gaps are.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The three things that break veterinary transcription&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Veterinary audio fails general models in three predictable places. Knowing them tells you exactly what to configure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Species and breeds.&lt;/strong&gt; This is the vocabulary humans-only training data never sees. A model that's never encountered "brachycephalic" won't render it correctly—it'll guess something phonetically close and wrong. The same goes for breed names that carry real clinical weight. A French Bulldog's breathing complaint and a Maine Coon's cardiac screening aren't trivia; they're context that shapes the whole note. If the breed is mangled, the record loses meaning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Drug names, including the dual-use ones.&lt;/strong&gt; Here's where it gets interesting. A lot of veterinary pharmacology overlaps directly with human medicine. Gabapentin, meloxicam, and metronidazole show up in both. But veterinary practice also leans on drugs and brand names that rarely appear in human dictation: carprofen (Rimadyl), maropitant (Cerenia), enrofloxacin (Baytril), firocoxib (Previcox). A model strong on human clinical terms gets you most of the way. The vet-specific names are the gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clinic-floor audio.&lt;/strong&gt; Barking. Multiple handlers. An owner, a tech, and a clinician all in one room. This is overlapping, noisy, multi-speaker audio—the kind that needs real diarization to untangle who said what.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The engine: entity accuracy first&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The foundation is &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt;, which delivers best-in-market accuracy on the words that matter most—drug names, proper nouns, and rare words. That's the category veterinary transcription lives and dies on. A model that handles common conversational English beautifully but drops the drug name is useless in a clinical record, human or animal. We've argued at length that &lt;a href="https://www.assemblyai.com/blog/word-error-rate-is-broken" rel="noopener noreferrer"&gt;word error rate is the wrong way to judge this&lt;/a&gt;—what you actually care about is whether the entities survive.&lt;/p&gt;

&lt;p&gt;Universal-3 Pro supports English, Spanish, French, German, Italian, and Portuguese, with native code-switching, at $0.21/hr async.&lt;/p&gt;

&lt;p&gt;Then there's Medical Mode. And here I want to be straight with you rather than oversell it.&lt;/p&gt;

&lt;p&gt;Medical Mode is built on human clinical terminology. It's tuned for the pharmacology, anatomy, and clinical language of human medicine—and that terminology overlaps heavily with veterinary practice. The drug classes, the dosing language, the anatomical vocabulary, the structure of a clinical note: a large share of that is shared. So Medical Mode gives you a real lift on the overlapping clinical pharmacology that veterinary work shares with human medicine. It is not a veterinary-specific model, and I won't pretend it is. It's a clinical-language engine that happens to cover most of what a vet says, because most of what a vet says is clinical language.&lt;/p&gt;

&lt;p&gt;Turning it on is one parameter—domain="medical-v1"—and it works on both Universal-3 Pro and &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to try it on a real exam recording?&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Get your free API key&lt;/a&gt; and run a file in minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Keyterms prompting: where you close the gap&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;So Medical Mode covers the overlap. What covers carprofen, brachycephalic, and the name of the specialty drug your practice stocks that nobody else does?&lt;/p&gt;

&lt;p&gt;Keyterms prompting. This is the mechanism for specialized and custom vocabulary, and it's the single most important feature for veterinary work specifically. You hand the model a list of the exact terms you care about—breed names, species, vet-specific and brand drug names, your practice's house vocabulary—and it weights them during recognition.&lt;/p&gt;

&lt;p&gt;The limits: up to 1,000 terms on async (a $0.05/hr add-on), up to 100 on streaming, free, and updatable mid-stream. That last detail matters more than it sounds. Mid-stream updates mean a live transcription session can load a new patient's breed and medication context as the exam moves room to room.&lt;/p&gt;

&lt;p&gt;Here's the pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;assemblyai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;

&lt;span class="nx"&gt;transcriber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Transcriber&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TranscriptionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="nx"&gt;speech_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;universal-3-pro&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="nx"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;medical-v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="nx"&gt;keyterms_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;carprofen&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;maropitant&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;brachycephalic&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;transcriber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;audio.wav&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three terms there for illustration. In practice you'd load your full formulary, the breeds your clinic sees most, and any species-specific terms your dictation uses. Load enrofloxacin, firocoxib, Cerenia, Baytril, Previcox, French Bulldog, Maine Coon—whatever your records actually contain.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What the gap actually looks like&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's an illustrative example—not a measured benchmark, just a concrete picture of the difference keyterms prompting makes on a vet-specific drug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Illustrative example&lt;/strong&gt;&lt;br&gt;
Clinician says: "Started the Frenchie on carprofen, 2.2 milligrams per kilogram, twice daily."&lt;/p&gt;

&lt;p&gt;Without keyterms prompting: "Started the Frenchie on &lt;em&gt;car profen&lt;/em&gt;, 2.2 milligrams per kilogram, twice daily."&lt;/p&gt;

&lt;p&gt;With keyterms_prompt=["carprofen"]: "Started the Frenchie on carprofen, 2.2 milligrams per kilogram, twice daily."&lt;/p&gt;

&lt;p&gt;The drug name is the whole point of the sentence. Get it wrong and the medication record is wrong, the dose is attached to a non-word, and anything downstream that reads the note inherits the error. This is exactly the entity-accuracy problem at the center of every clinical pipeline—the difference is that in vet work, you've got a clean, simple lever to fix it.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Multi-speaker exam rooms: diarization&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A vet exam usually isn't one voice. It's a clinician, a tech, and often the pet owner, all talking in the same room. To turn that into a usable record you need to know who said what.&lt;/p&gt;

&lt;p&gt;That's &lt;a href="https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work" rel="noopener noreferrer"&gt;speaker diarization&lt;/a&gt;. It separates the speakers so the owner's description of symptoms, the tech's readback, and the clinician's assessment land as distinct contributions instead of one undifferentiated wall of text. Combined with PII redaction for owner details, it gets you a transcript that's structured enough to actually build on.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The part most teams miss&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The interesting thing about veterinary transcription is that it's not actually a harder problem than human clinical transcription—it's a &lt;em&gt;different distribution&lt;/em&gt; of the same problem. The clinical-language backbone is shared. What changes is the long tail: the breeds, the species, the vet-specific drug names that no general training corpus weights heavily.&lt;/p&gt;

&lt;p&gt;And the long tail is the one part of speech recognition you can configure directly. You can't hand-tune how a model handles general English, but you can hand it the exact 200 terms your practice uses every day and have it prioritize them. That makes veterinary transcription one of the more &lt;em&gt;solvable&lt;/em&gt; specialized domains out there—not because the audio is easy, but because the hard part is a list you already know.&lt;/p&gt;

&lt;p&gt;Write down your formulary and your breed list. That's most of your accuracy problem solved before you transcribe a single second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get your free API key&lt;/strong&gt; and try it on your own exam audio: &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;sign up here&lt;/a&gt;, or browse the &lt;a href="https://www.assemblyai.com/docs" rel="noopener noreferrer"&gt;docs&lt;/a&gt; to see the full keyterms prompting reference&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Is there a veterinary-specific transcription model?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Not as a separate model—and you don't need one. Universal-3 Pro's entity accuracy plus Medical Mode covers the large overlap between human and veterinary clinical language, and keyterms prompting covers the species, breed, and vet-specific drug gaps. That combination handles veterinary audio without a dedicated model.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Does Medical Mode actually help with animal patients?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Yes, because veterinary pharmacology and clinical language overlap heavily with human medicine—shared drug classes, dosing language, and anatomical terms. Medical Mode is built on human clinical terminology, so it lifts accuracy on everything shared. Keyterms prompting handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How do I make sure carprofen, Cerenia, and breed names get transcribed correctly?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Load them into keyterms prompting—up to 1,000 terms on async, up to 100 on streaming, updatable mid-stream. It's the mechanism for custom and specialized vocabulary, and it's where you close the veterinary-specific gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Can it handle a noisy exam room with multiple people talking?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&amp;nbsp;Yes. Speaker diarization separates the clinician, tech, and owner so the transcript attributes each statement to the right speaker, even in overlapping, noisy clinic-floor audio.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What does this cost to run?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Universal-3 Pro is $0.21/hr async. Medical Mode adds $0.15/hr ($0.36/hr combined). Keyterms prompting is a $0.05/hr add-on on async and free on streaming. Full details are on the &lt;a href="https://www.assemblyai.com/pricing" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>speechtotext</category>
      <category>api</category>
    </item>
    <item>
      <title>Building Behavioral Health Documentation Clinicians Trust</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:24:08 +0000</pubDate>
      <link>https://dev.to/martschweiger/building-behavioral-health-documentation-clinicians-trust-3j0f</link>
      <guid>https://dev.to/martschweiger/building-behavioral-health-documentation-clinicians-trust-3j0f</guid>
      <description>&lt;p&gt;Trust is a strange thing to engineer into a transcript.&lt;/p&gt;

&lt;p&gt;A clinician doesn't sit there grading your word error rate. They glance at the note, see that "sertraline" came through as "sertraline" and not "sir Tralee," and decide—usually in the first session—whether your product is something they can rely on or something they have to babysit. That decision is mostly subconscious, and it's almost always made on the words that carry clinical weight.&lt;/p&gt;

&lt;p&gt;So if you're a product manager or founder building a behavioral health scribe, the real question isn't "how accurate is the transcription?" It's "is it accurate on the things a clinician would notice being wrong?" Those are very different bars.&lt;/p&gt;

&lt;p&gt;We've written a &lt;a href="https://www.assemblyai.com/blog/how-to-build-ai-scribe-therapy" rel="noopener noreferrer"&gt;step-by-step tutorial on building an AI scribe for therapy sessions&lt;/a&gt;—the actual code, the upload-and-transcribe loop, the configuration. This post is the companion to that. Less about how to wire it up, more about what earns the clinician's trust once it's wired up. Three things, in my experience, do most of the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Accuracy on the words that actually matter&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Behavioral health vocabulary is unforgiving in a specific way. Psychiatric medication names are dense, similar-sounding, and dose-dependent. Sertraline, lamotrigine, quetiapine, bupropion—these aren't words a general speech-to-text model has heard a million times, and they sit next to numbers ("up to 200 milligrams," "split the 300 into two doses") where a single transposed entity changes the clinical meaning of the note.&lt;/p&gt;

&lt;p&gt;Here's the trap a lot of teams fall into. They benchmark on overall word error rate, see a number that looks great, and ship. But overall WER averages across "the," "and," "um," and "lamotrigine." A model can post a lovely aggregate number while quietly fumbling the one word per minute that a prescriber will actually catch.&lt;/p&gt;

&lt;p&gt;That's why we measure Medical Mode on Missed Entity Rate instead—how often the model drops or mangles a clinically meaningful entity like a drug, a dose, or a diagnosis. Medical Mode delivers a 3.2% MER, the lowest across the providers we benchmarked, and roughly 20% fewer missed medical entities than &lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; running on its own. You can see the full breakdown on our &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;benchmarks page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Turning it on is one parameter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;assemblyai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;

&lt;span class="nx"&gt;transcriber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Transcriber&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TranscriptionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="nx"&gt;speech_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;universal-3-pro&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="nx"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;medical-v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;transcriber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;clinical-audio.wav&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That domain="medical-v1" flag is the whole activation. It works on Universal-3 Pro for pre-recorded sessions and on &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; for live ones, so your documentation pipeline behaves the same whether you're processing a recorded intake overnight or transcribing a session as it happens.&lt;/p&gt;

&lt;p&gt;And when a practice uses regional brand names or a niche formulary that even a medical model wouldn't expect, &lt;a href="https://www.assemblyai.com/blog/medical-transcription-api" rel="noopener noreferrer"&gt;keyterms prompting&lt;/a&gt; lets you hand the model that vocabulary up front. Think of it as telling the model what to listen for before it ever hears the audio.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ready to test accuracy on your own clinical audio?&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;&lt;strong&gt;Explore Voice AI for healthcare&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;or&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;&lt;strong&gt;grab an API key&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Knowing who said what&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now here's where behavioral health diverges from almost every other medical transcription use case.&lt;/p&gt;

&lt;p&gt;A radiology dictation is one voice. A therapy session is at least two—and group, couples, and family sessions can be four, five, six people talking over each other, finishing each other's sentences, going quiet, coming back. If your transcript renders all of that as one undifferentiated wall of text, the clinical value collapses. A note that says "patient reported increased anxiety" is useless when there were three patients in the room and you can't tell which one said it.&lt;/p&gt;

&lt;p&gt;This is exactly what &lt;a href="https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work" rel="noopener noreferrer"&gt;speaker diarization&lt;/a&gt; solves—segmenting the audio by who's speaking so each utterance is attributed to the right person. In behavioral health it's not a nice-to-have feature sitting next to accuracy. It's part of the accuracy. And on Universal-3.5 Pro Realtime, live diarization gets a second pass: the model labels speakers during the session, then re-clusters every voice when the stream ends and sends a single revision correcting any labels it now knows were wrong—async-grade speaker accuracy within about half a second of the session ending, up to 10 speakers.&lt;a href="https://www.assemblyai.com/customers/jotpsych" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/customers/jotpsych" rel="noopener noreferrer"&gt;JotPsych&lt;/a&gt;, which builds documentation tooling for behavioral health, put it plainly. Jackson Bierfeldt, their Cofounder and CTO, told us: "In the medical context, accuracy is highly important…[and] there can be multiple people present. Separating them is key to accuracy. The biggest impact AssemblyAI has had has been in enabling our technical team to focus on workflow-specific features rather than a general speech-to-text pipeline."&lt;/p&gt;

&lt;p&gt;Read that last sentence again, because it's the part product leaders should care most about. JotPsych didn't just want accurate diarization—they wanted to stop building and maintaining a speech pipeline at all, so their engineers could spend their time on the things that actually differentiate a behavioral health product. The transcription layer should be something you configure, not something you staff a team around.&lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;NovoPsych&lt;/a&gt;, another team building in behavioral health, is solving the same shape of problem—turning sensitive, multi-speaker clinical conversation into structured documentation that a clinician will sign their name to. When the words and the speakers are both right, the clinician's review time drops, and that's where the trust compounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Handling sensitive sessions with care&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let's talk about privacy, because behavioral health is about the most sensitive category of data there is, and a clinician's trust evaporates fast if they suspect a session is being handled carelessly&lt;/p&gt;

&lt;p&gt;I'll be direct about what this is and isn't. This isn't the headline of your product—no clinician chose your scribe because of an infrastructure diagram. But it's the floor underneath everything else, and if the floor isn't there, none of the accuracy matters.&lt;/p&gt;

&lt;p&gt;AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA) that is required under HIPAA to ensure that AssemblyAI appropriately safeguards PHI. The infrastructure is BAA-eligible, and the BAA is included.&lt;/p&gt;

&lt;p&gt;On the engineering side, you've got PII redaction built in. With redact_pii, you can strip identifying information out of transcripts—names, contact details, the kinds of identifiers that turn an ordinary transcript into something you have to lock down. For behavioral health, where a single transcript might name a patient, their family members, and their employer in the first two minutes, redaction is a practical tool, not a checkbox.&lt;/p&gt;

&lt;p&gt;The point is that data handling should be designed in from the first session, not retrofitted after your first enterprise customer asks the question in a security review.&lt;br&gt;
&lt;strong&gt;Building for sensitive clinical settings?&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;&lt;strong&gt;Start with a free API key&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;and review the&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/docs" rel="noopener noreferrer"&gt;&lt;strong&gt;docs&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The trust isn't in the transcript&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's the thing I'd leave you with, and it's the part that's easy to miss when you're heads-down on accuracy metrics.&lt;/p&gt;

&lt;p&gt;Clinician trust isn't earned in the moment the transcript is correct. It's earned in the moment the clinician stops checking. The first few sessions, they read every line against their memory of what happened. Then one day they skim, sign, and move on—because the medication names have been right, the speakers have been attributed correctly, and nothing has surprised them. That shift from auditing to trusting is the entire ballgame, and it only happens if the foundation is boring and reliable across hundreds of sessions, not impressive in a demo.&lt;/p&gt;

&lt;p&gt;Build for the hundredth session, not the first.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See what Medical Mode does on your audio—&lt;/strong&gt;&lt;a href="https://www.assemblyai.com/solutions/medical" rel="noopener noreferrer"&gt;&lt;strong&gt;explore Voice AI for healthcare&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;or&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;&lt;strong&gt;get your API key&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How accurate is Medical Mode on psychiatric medication names?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Medical Mode posts a 3.2% Missed Entity Rate—the lowest across the providers we benchmarked—and around 20% fewer missed medical entities than Universal-3 Pro on its own. Because MER measures clinically meaningful entities specifically, including drug names and dosages, it's a better proxy for behavioral health accuracy than overall word error rate. The full numbers live on our &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;benchmarks page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can it separate speakers in group or family therapy sessions?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes. Speaker diarization segments the audio by speaker and attributes each utterance to the right person, which is what makes a multi-party session usable as documentation rather than a single block of text. On Universal-3.5 Pro Realtime, live labels are refined by an end-of-stream revision for async-grade speaker accuracy, up to 10 speakers. It's the capability JotPsych called out as key to accuracy in the medical context.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Does it work for live sessions, or only recorded ones?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Both. Medical Mode runs on Universal-3 Pro for pre-recorded audio and on Universal-3.5 Pro Realtime for live transcription, so you can build the same accuracy into a real-time scribe and an after-hours batch pipeline with the same one-parameter activation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do you handle PHI and HIPAA?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum, which is required under HIPAA, on BAA-eligible infrastructure. You can also use redact_pii to strip identifying information from transcripts as part of your data handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What does Medical Mode cost?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Medical Mode adds $0.15/hr on top of the base model. Universal-3 Pro with Medical Mode comes to $0.36/hr, and Universal-3.5 Pro Realtime with Medical Mode is $0.60/hr. Full details are on the &lt;a href="https://www.assemblyai.com/pricing" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>healthcare</category>
      <category>machinelearning</category>
      <category>speechtotext</category>
    </item>
    <item>
      <title>Medical Transcription in Spanish, German, and French</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Tue, 23 Jun 2026 21:24:01 +0000</pubDate>
      <link>https://dev.to/martschweiger/medical-transcription-in-spanish-german-and-french-18no</link>
      <guid>https://dev.to/martschweiger/medical-transcription-in-spanish-german-and-french-18no</guid>
      <description>&lt;p&gt;Most multilingual transcription stories are about coverage. How many languages does the model support? Forty? A hundred?&lt;/p&gt;

&lt;p&gt;That's the wrong question for medicine.&lt;/p&gt;

&lt;p&gt;In a real clinic, the hard problem isn't supporting French. It's the moment a Spanish-speaking patient in a US clinic says "me duele el pecho, like a pressure" in one breath, and your transcript has to get both halves right—the Spanish symptom and the English qualifier—without anyone touching a language setting. Coverage counts languages. Clinical accuracy counts the words inside a single messy sentence. Those are not the same achievement.&lt;/p&gt;

&lt;p&gt;Medical Mode launches benchmarked across four languages—English, Spanish, German, and French—for both pre-recorded and streaming audio. But the part I want to spend this post on is the part that actually breaks competing systems: what happens at the seam between languages.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why multilingual medical transcription is genuinely hard&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let's be honest about the difficulty, because it's easy to wave at "multilingual support" as if it were a switch you flip.&lt;/p&gt;

&lt;p&gt;First, medical vocabulary is treacherous across languages precisely because so much of it shares Latin and Greek roots. "Hypertension," "hypertensión," "Hypertonie," and "hypertension" look and sound like cousins—and that similarity is a trap, not a help. A model has to map each one to the right clinical entity in the right language, not blur them into an average. Near-identical isn't identical, and in a medication or diagnosis field, close is wrong.&lt;/p&gt;

&lt;p&gt;Second, accents. A francophone clinician in Montreal, a Swabian physician, and a Madrid pharmacist don't pronounce their own languages the way a textbook does, let alone the way a model's training distribution assumes.&lt;/p&gt;

&lt;p&gt;Third—and this is the one that quietly wrecks deployments—real clinical encounters are mixed-language. US Hispanic care, the DACH region with its English-heavy medical training, francophone clinics with English drug brand names. People code-switch. They start a sentence in one language and finish it in another, drop an English drug name into a Spanish sentence, or answer a German question in English because that's how they learned the term.&lt;/p&gt;

&lt;p&gt;A pipeline built on "detect the language, then transcribe in that language" has no good answer here. By the time it's committed to Spanish, the English half of the sentence is already a casualty.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Code-switching is the whole differentiator&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;So here's where it gets interesting.&lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/universal-3-pro" rel="noopener noreferrer"&gt;Universal-3 Pro&lt;/a&gt; handles native intra-utterance code-switching across all six of its languages—English, Spanish, French, German, Italian, and Portuguese. Intra-utterance is the operative phrase. Not "we detect a language per file," and not even "per sentence." Within a single utterance, the model can follow a speaker as they move between languages and keep transcribing accurately the entire way through.&lt;/p&gt;

&lt;p&gt;For live audio, &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; extends that same native code-switching to 18 languages—so a streaming clinical encounter that mixes languages mid-sentence holds the thread across a much wider range. Medical Mode itself is benchmarked across the four launch languages (English, Spanish, German, and French), but the underlying code-switching the model relies on is built in across all of them.&lt;/p&gt;

&lt;p&gt;No language toggle. No separate pipeline per language. No detect-then-route logic you have to build and maintain.&lt;br&gt;
That means the Spanglish sentence—"me duele el pecho, like a pressure"—comes through intact, both halves, because the model was never forced to pick a lane. The clinician who slips into English for the drug name and back into French for the symptom doesn't break the transcript. The patient who answers in their first language mid-question doesn't either.&lt;/p&gt;

&lt;p&gt;We went deeper on how this works in our &lt;a href="https://www.assemblyai.com/blog/multilingual-speech-to-text-api-universal-3-pro" rel="noopener noreferrer"&gt;piece on multilingual streaming and native code-switching&lt;/a&gt;, and it's worth reading if you're serving any population where two languages share a room.&lt;/p&gt;

&lt;p&gt;The practical upshot for product teams: you build one integration. Not one per market.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Where the accuracy actually shows up&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Code-switching is the headline, but it only matters if the underlying transcription is clinically accurate, so let's tie it back to numbers.&lt;/p&gt;

&lt;p&gt;Medical Mode delivers a 3.2% Missed Entity Rate—the lowest across the providers we benchmarked—and roughly 20% fewer missed medical entities than Universal-3 Pro running on its own. MER is the right yardstick here because it measures what a clinician actually cares about: how often a clinically meaningful entity, a drug or a dose or a diagnosis, gets dropped or mangled. And critically, Medical Mode is benchmarked across all four launch languages, so that 3.2% isn't an English-only figure dressed up as multilingual. See the full breakdown on the &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;benchmarks page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Activation is the same single parameter in every language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;assemblyai&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;

&lt;span class="nx"&gt;transcriber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Transcriber&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TranscriptionConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="nx"&gt;speech_models&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;universal-3-pro&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="nx"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;medical-v1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nx"&gt;transcript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;transcriber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;clinical-audio.wav&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can let the model auto-detect language, but you don't need a language toggle to get code-switching—it's native. The domain="medical-v1" flag is identical whether the audio is Spanish, German, French, English, or a mix of them in one recording.&lt;/p&gt;

&lt;p&gt;This runs the same way for both modes. Pre-recorded audio goes through Universal-3 Pro at $0.21/hr, and live audio runs on &lt;a href="https://www.assemblyai.com/blog/universal-3-5-pro-realtime" rel="noopener noreferrer"&gt;Universal-3.5 Pro Realtime&lt;/a&gt; at $0.45/hr base, at streaming latency—end-of-turn detection lands around 300ms. Medical Mode adds $0.15/hr in either case—so Universal-3 Pro plus Medical Mode is $0.36/hr, and the streaming variant plus Medical Mode is $0.60/hr. The full table is on the &lt;a href="https://www.assemblyai.com/pricing" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Want to hear it handle your languages?&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;&lt;strong&gt;Get a free API key&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;or try it in the&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;&lt;strong&gt;playground&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Region-specific drug names and keyterms prompting&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One more wrinkle that matters specifically for multilingual healthcare: drug brand names don't translate, they vary by market.&lt;/p&gt;

&lt;p&gt;The same molecule ships under different brand names in the US, Germany, France, and Spain. A general medical model can't possibly anticipate every regional formulary. So when you're deploying into the DACH region or francophone care, &lt;a href="https://www.assemblyai.com/blog/medical-transcription-api" rel="noopener noreferrer"&gt;keyterms prompting&lt;/a&gt; lets you hand the model the specific brand names, local terminology, and formulary your clinicians actually use—before it processes a single second of audio. It's the difference between a model that knows the active ingredient and one that also knows what your pharmacist calls it.&lt;/p&gt;

&lt;p&gt;Pair that with diarization for multi-speaker encounters and redact_pii for sensitive data, and you've got a clinical pipeline that holds up across markets.&lt;/p&gt;

&lt;p&gt;On the privacy side: AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a standard Business Associate Addendum (BAA) that is required under HIPAA to ensure that AssemblyAI appropriately safeguards PHI—on BAA-eligible infrastructure, with the BAA included.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The integration you don't have to fork&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here's the insight worth sitting with if you're choosing an approach for a multilingual patient population.&lt;/p&gt;

&lt;p&gt;The hidden cost of "supports N languages" architectures isn't accuracy—it's branching. A detect-then-route pipeline forces you to maintain a code path per language, test each one, and handle the seams between them, and those seams are exactly where mixed-language encounters live. Native code-switching collapses that whole tree into one path. You're not just getting a better transcript on Spanglish; you're getting an integration that doesn't fork every time you enter a new market.&lt;/p&gt;

&lt;p&gt;Build one pipeline, serve every patient who walks in, regardless of which language they reach for mid-sentence.****&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ready to build it?&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;&lt;strong&gt;Get your free API key&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;or test your own clinical audio in the&lt;/strong&gt; &lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;&lt;strong&gt;playground&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Frequently asked questions&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Which languages does Medical Mode support?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Medical Mode is benchmarked across four languages at launch—English, Spanish, German, and French—for both pre-recorded and streaming audio. The underlying Universal-3 Pro model supports six languages for pre-recorded audio (adding Italian and Portuguese) with native code-switching, and Universal-3.5 Pro Realtime extends native code-switching to 18 languages for live audio.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is intra-utterance code-switching, and why does it matter for clinics?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;It means the model can follow a speaker who changes languages within a single utterance—not just per file or per sentence. In real clinical settings where patients and clinicians mix languages mid-thought, this keeps the transcript intact without a language toggle or a separate pipeline per language.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How accurate is multilingual medical transcription?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Medical Mode posts a 3.2% Missed Entity Rate, the lowest across the providers we benchmarked, with around 20% fewer missed medical entities than Universal-3 Pro alone—and it's benchmarked across all four launch languages, not English only. Details are on the &lt;a href="https://www.assemblyai.com/benchmarks" rel="noopener noreferrer"&gt;benchmarks page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How do I handle region-specific drug brand names?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use keyterms prompting to supply the specific brand names, local terminology, and formulary your clinicians use before transcription runs. This is especially useful across markets where the same molecule ships under different brand names.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Does code-switching require a different setup than single-language transcription?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;No. You activate Medical Mode with the single domain="medical-v1" parameter, and code-switching is native—there's no toggle to enable and no separate multilingual pipeline to configure. On streaming, if a session is genuinely monolingual you can pass language_code to bias the model toward one language; leave it off to keep native code-switching.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>healthcare</category>
      <category>speechtotext</category>
    </item>
    <item>
      <title>Best platforms for enterprise voice agents</title>
      <dc:creator>Mart Schweiger</dc:creator>
      <pubDate>Mon, 15 Jun 2026 16:25:04 +0000</pubDate>
      <link>https://dev.to/martschweiger/best-platforms-for-enterprise-voice-agents-469h</link>
      <guid>https://dev.to/martschweiger/best-platforms-for-enterprise-voice-agents-469h</guid>
      <description>&lt;p&gt;Every voice agent demo looks good. A founder opens a laptop, speaks a few sentences, and the agent responds with something reasonable in under a second. The room nods. But here's where it gets interesting: that same agent needs to handle 10,000 concurrent calls on a Tuesday morning when half your customer base decides to check their account balance at once. It needs to correctly capture email addresses, prescription numbers, and policy IDs—not approximate them. And it needs to do all of this while meeting the compliance requirements your legal team won't budge on.&lt;/p&gt;

&lt;p&gt;The gap between a working demo and a production &lt;a href="https://www.assemblyai.com/blog/ai-voice-agents" rel="noopener noreferrer"&gt;voice agent&lt;/a&gt; that enterprises actually depend on is enormous. Choosing the wrong platform means months of integration work followed by accuracy problems that surface only after you've committed.&lt;/p&gt;

&lt;p&gt;So what should enterprise teams actually look for? And which platforms deliver? Let's break it down.&lt;/p&gt;

&lt;h2&gt;
  
  
  What enterprise voice agents actually need
&lt;/h2&gt;

&lt;p&gt;Most platform comparison guides focus on surface-level features: language count, voice selection, basic latency numbers. That's table stakes. Enterprise voice agents have a different set of requirements that only become obvious once you're building for production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speech accuracy on entities—not just words
&lt;/h3&gt;

&lt;p&gt;Overall word accuracy matters, but it's not the whole picture. Voice agents act on specific pieces of information: account numbers, email addresses, phone numbers, medication names, confirmation codes. If the speech-to-text layer gets "RX-7704132" wrong, the LLM downstream acts on bad data. The agent doesn't just mishear—it takes the wrong action. Missed entity rate is the metric that actually predicts whether your agent completes tasks or frustrates customers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sub-second latency that holds under load
&lt;/h3&gt;

&lt;p&gt;A one-second response time in a controlled demo is easy. Maintaining that latency at scale—with thousands of concurrent sessions, tool calls mid-conversation, and noisy telephony audio—is a different engineering problem entirely. Ask any platform what their P95 latency looks like at peak concurrency, not just their best-case P50.&lt;/p&gt;

&lt;h3&gt;
  
  
  Turn detection that works in real conversations
&lt;/h3&gt;

&lt;p&gt;This is the thing most teams underestimate. Basic voice activity detection (VAD) uses silence thresholds to decide when someone's done talking. But people pause mid-sentence. They say "um" while thinking. They say "uh-huh" to acknowledge without wanting to interrupt. If your &lt;a href="https://www.assemblyai.com/blog/voice-agent-architecture" rel="noopener noreferrer"&gt;voice agent architecture&lt;/a&gt; can't distinguish between a thinking pause and a completed turn, every conversation will feel awkward—and your users will notice immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compliance that's actually enforceable
&lt;/h3&gt;

&lt;p&gt;For &lt;a href="https://www.assemblyai.com/enterprise" rel="noopener noreferrer"&gt;enterprise&lt;/a&gt; deployments, SOC 2 Type 2 certification is the baseline. If you're in healthcare, you need a Business Associate Addendum (BAA) in place before any patient data touches the platform. If you're in finance, you need audit trails and data retention controls. Don't settle for "we're working on it"—ask for the documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scalability without concurrency ceilings
&lt;/h3&gt;

&lt;p&gt;Some platforms cap concurrent sessions or require concurrency commitments before you can scale. That's fine for a pilot. It's a problem when your &lt;a href="https://www.assemblyai.com/solutions/contact-centers" rel="noopener noreferrer"&gt;contact center&lt;/a&gt; traffic spikes 3x on the first of the month and you're hitting rate limits. True enterprise platforms autoscale without renegotiation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mid-conversation flexibility
&lt;/h3&gt;

&lt;p&gt;Production voice agents aren't static. You need to update the system prompt when a customer provides their account type. You need to swap tools when the conversation shifts from billing to technical support. You need to adjust VAD sensitivity when a caller is in a noisy environment. The ability to change configuration mid-session—without dropping the connection—separates infrastructure built for real use cases from platforms built for demos.&lt;/p&gt;

&lt;p&gt;Build enterprise voice agents on the most accurate speech foundation&lt;/p&gt;

&lt;p&gt;AssemblyAI's Voice Agent API handles STT, LLM, and TTS in a single WebSocket connection. $4.50/hr flat rate, no concurrency caps, SOC 2 Type 2 certified.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The major platforms compared
&lt;/h2&gt;

&lt;p&gt;There are several serious options for building enterprise voice agents in 2026. Here's an honest look at how they stack up—what each does well, where they fall short, and which use cases they're best suited for.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Word accuracy&lt;/th&gt;
&lt;th&gt;Missed entity rate&lt;/th&gt;
&lt;th&gt;Pricing&lt;/th&gt;
&lt;th&gt;Turn detection&lt;/th&gt;
&lt;th&gt;Concurrency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI Voice Agent API&lt;/td&gt;
&lt;td&gt;Cascaded (STT + LLM + TTS, single WebSocket)&lt;/td&gt;
&lt;td&gt;94.07%&lt;/td&gt;
&lt;td&gt;16.7%&lt;/td&gt;
&lt;td&gt;$4.50/hr flat&lt;/td&gt;
&lt;td&gt;Semantic + neural network + VAD&lt;/td&gt;
&lt;td&gt;No concurrency caps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI Universal-3 Pro Streaming (BYO stack)&lt;/td&gt;
&lt;td&gt;Standalone STT (bring your own LLM + TTS)&lt;/td&gt;
&lt;td&gt;94.07%&lt;/td&gt;
&lt;td&gt;16.7%&lt;/td&gt;
&lt;td&gt;$0.45/hr (STT only)&lt;/td&gt;
&lt;td&gt;Provided by your orchestrator&lt;/td&gt;
&lt;td&gt;Unlimited, autoscaling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Realtime API&lt;/td&gt;
&lt;td&gt;Native multimodal (GPT-4o)&lt;/td&gt;
&lt;td&gt;93.13%&lt;/td&gt;
&lt;td&gt;23.3%&lt;/td&gt;
&lt;td&gt;~$18/hr, per-token billing&lt;/td&gt;
&lt;td&gt;Basic VAD&lt;/td&gt;
&lt;td&gt;99+ languages; complex event surface&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram Voice Agent API&lt;/td&gt;
&lt;td&gt;Cascaded (Nova-3)&lt;/td&gt;
&lt;td&gt;92.10%&lt;/td&gt;
&lt;td&gt;25.5%&lt;/td&gt;
&lt;td&gt;~$4.50/hr, concurrency-metered&lt;/td&gt;
&lt;td&gt;Traditional VAD&lt;/td&gt;
&lt;td&gt;Requires concurrency tier commitments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs Conversational AI&lt;/td&gt;
&lt;td&gt;TTS-first conversational stack&lt;/td&gt;
&lt;td&gt;Not published&lt;/td&gt;
&lt;td&gt;&amp;gt;25.2%&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Standard VAD&lt;/td&gt;
&lt;td&gt;30-agent concurrency cap&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  AssemblyAI Voice Agent API
&lt;/h3&gt;

&lt;p&gt;AssemblyAI's &lt;a href="https://www.assemblyai.com/products/voice-agent-api" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt; takes a cascaded architecture approach—dedicated best-in-class models for STT, LLM, and TTS, exposed through a single WebSocket. You stream audio in, you get audio back. One connection, one bill.&lt;/p&gt;

&lt;p&gt;The speech understanding layer is built on &lt;a href="https://www.assemblyai.com/universal-3-pro-streaming" rel="noopener noreferrer"&gt;Universal-3 Pro Streaming&lt;/a&gt;, which ranks #1 on the Hugging Face Open ASR Leaderboard. In benchmarks, it achieves 94.07% word accuracy with a 16.7% missed entity rate on names, emails, phone numbers, and credit card numbers. That entity accuracy gap is significant—we'll get into why shortly.&lt;/p&gt;

&lt;p&gt;The developer experience is notably simple. It's a standard JSON API over WebSocket with no proprietary SDK required. Most developers get a working agent running the same afternoon. But simple doesn't mean limited: you get speech-aware turn detection (semantic + neural network + VAD), tool calling via JSON Schema, mid-session updates to prompt, voice, tools, and VAD settings, and 30-second session resumption if the WebSocket drops.&lt;/p&gt;

&lt;p&gt;Pricing is a flat $4.50/hr covering the entire pipeline. No token math, no separate input/output charges. Six languages currently supported: English, Spanish, French, German, Italian, and Portuguese. On compliance, AssemblyAI holds SOC 2 Type 2 and ISO 27001 certifications, with a BAA available for healthcare use cases. &lt;a href="https://www.assemblyai.com/medical-mode" rel="noopener noreferrer"&gt;Medical Mode&lt;/a&gt; is also available for improved accuracy on clinical terminology.&lt;/p&gt;

&lt;h3&gt;
  
  
  AssemblyAI Universal-3 Pro Streaming (bring-your-own-stack)
&lt;/h3&gt;

&lt;p&gt;Not every team wants a managed pipeline. If you've already built an orchestration layer with LiveKit, Pipecat, or Vapi, you might want the best possible &lt;a href="https://www.assemblyai.com/products/streaming-speech-to-text" rel="noopener noreferrer"&gt;streaming speech-to-text&lt;/a&gt; without replacing the rest of your stack.&lt;/p&gt;

&lt;p&gt;That's where Universal-3 Pro Streaming as a standalone STT API comes in. Same speech model, same accuracy, $0.45/hr for transcription only. Unlimited concurrent streams with autoscaling—no concurrency commitments required. It drops into existing pipelines as the STT layer, giving you full control over your LLM and TTS choices.&lt;/p&gt;

&lt;p&gt;This is the best option for teams that already have an orchestrator and want to upgrade their speech understanding without rearchitecting everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenAI Realtime API
&lt;/h3&gt;

&lt;p&gt;OpenAI takes a fundamentally different approach. Their Realtime API uses GPT-4o as a native multimodal model that handles audio as one of many modalities—text, images, video, and voice. It's not a pipeline designed specifically for speech conversations.&lt;/p&gt;

&lt;p&gt;The upside is broad language support (99+ languages) and the ability to handle multimodal inputs. For prototyping conversational experiences, it's fast to get started.&lt;/p&gt;

&lt;p&gt;The downsides become clear at enterprise scale. Pricing runs approximately $18/hr with per-token billing across 30+ event types—roughly 4x more expensive than cascaded alternatives. Word accuracy sits at 93.13% with a 23.3% missed entity rate, which reflects the trade-off of using a generalist model for speech-specific tasks. Turn detection relies on basic VAD rather than semantic understanding, which means more awkward interruptions in real conversations. And the developer surface area is complex—30+ event types to handle compared to a handful with purpose-built APIs.&lt;/p&gt;

&lt;p&gt;OpenAI Realtime is a strong choice for demos, browser-first apps, and multilingual prototyping. But at enterprise scale, the cost and accuracy trade-offs add up quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deepgram Voice Agent API
&lt;/h3&gt;

&lt;p&gt;Deepgram also uses a cascaded architecture, similar to AssemblyAI. Their voice agent offering runs approximately $4.50/hr but uses concurrency-metered billing—you'll need to commit to concurrency tiers as you scale.&lt;/p&gt;

&lt;p&gt;Nova-3 achieves 92.10% word accuracy with a 25.5% missed entity rate—a meaningful gap compared to AssemblyAI's 16.7%. Turn detection relies on traditional VAD without the semantic layer that distinguishes thinking pauses from completed turns. Mid-session updates are limited to prompt and voice only. For enterprise deployments where &lt;a href="https://www.assemblyai.com/blog/assemblyai-vs-deepgram-best-voice-agent-api" rel="noopener noreferrer"&gt;accuracy on real-world data&lt;/a&gt; directly impacts outcomes, those gaps add up.&lt;/p&gt;

&lt;h3&gt;
  
  
  ElevenLabs Conversational AI
&lt;/h3&gt;

&lt;p&gt;ElevenLabs built its reputation on voice synthesis, and their TTS remains excellent. Their Conversational AI product extends that focus into the voice agent space.&lt;/p&gt;

&lt;p&gt;The enterprise limitation is the 30-agent concurrency cap. For contact centers or any high-volume deployment, that ceiling makes scaling impossible. ElevenLabs started as a TTS company—their speech understanding trails purpose-built STT providers, with a missed entity rate above 25.2%. For applications where voice output quality is the primary differentiator and concurrency demands are modest, it's a solid fit. For enterprise voice agents that need to scale, the constraints are significant.&lt;/p&gt;

&lt;p&gt;Test voice agent accuracy on your own audio&lt;/p&gt;

&lt;p&gt;See how Universal-3 Pro Streaming captures entities, handles accents, and detects turns. Compare the results to what you're getting from your current provider.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try playground&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why speech accuracy is the deciding factor for enterprise
&lt;/h2&gt;

&lt;p&gt;Here's the thing most teams don't realize until they're deep into production: transcription errors don't just reduce transcript quality. They cascade through the entire voice agent pipeline.&lt;/p&gt;

&lt;p&gt;When your STT layer gets an email address wrong, the LLM doesn't know it's wrong. It processes the incorrect email as if it's fact, then takes action on it—sending a confirmation to the wrong address, looking up the wrong account, or storing incorrect contact information in your CRM. The agent didn't "make a mistake"—it did exactly what it was told with bad input.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.assemblyai.com/blog/choosing-a-stt-api-for-voice-agents" rel="noopener noreferrer"&gt;STT model you choose&lt;/a&gt; determines your agent's effective intelligence. A brilliant LLM fed wrong data will confidently do the wrong thing.&lt;/p&gt;

&lt;p&gt;Consider a pharmacy refill scenario. A caller provides their prescription number, date of birth, medication name, dosage, address, and phone number in a single conversation. AssemblyAI's Voice Agent API correctly transcribes "RX-7704132" and "Metoprolol 80mg" while formatting the date of birth, address, and phone number accurately. In the same scenario, Deepgram's transcription produces "dash seven seven zero four one three two" without the RX prefix, drops date formatting, and garbles the medication dosage format.&lt;/p&gt;

&lt;p&gt;That's not a cherry-picked example—it reflects the systematic accuracy advantage that comes from building an entire pipeline around a purpose-built speech model. When you're processing thousands of these conversations daily, the difference between a 16.7% missed entity rate and a 25.5% missed entity rate translates directly into failed tasks, repeat calls, and frustrated customers.&lt;/p&gt;

&lt;p&gt;In our Voice Agent Report, 76% of respondents rated speech-to-text accuracy as the single most important factor when building voice agents—above latency, cost, and integration capabilities. The data backs up what builders already know intuitively: if the agent can't hear correctly, nothing else matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two-path approach
&lt;/h2&gt;

&lt;p&gt;One thing that sets AssemblyAI apart in the &lt;a href="https://www.assemblyai.com/solutions/voice-agents" rel="noopener noreferrer"&gt;voice agent solutions&lt;/a&gt; space is that it offers two distinct paths to the same speech accuracy foundation.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.assemblyai.com/blog/the-voice-ai-stack-for-building-agents" rel="noopener noreferrer"&gt;Voice Agent API&lt;/a&gt; is the fastest path to production. One WebSocket, one bill, working agent in an afternoon. It's purpose-built for teams that want to focus on their product logic rather than managing voice infrastructure. You write the system prompt, register your tools, and ship.&lt;/p&gt;

&lt;p&gt;Universal-3 Pro Streaming as a standalone STT API is for teams that want full architectural control. If you've already invested in an orchestration framework, a specific LLM, or custom TTS, you can drop in the same speech model that powers the Voice Agent API without changing anything else in your stack.&lt;/p&gt;

&lt;p&gt;Both paths share the same Universal-3 Pro speech recognition foundation. Transcription quality stays consistent regardless of which approach you choose. And because both use WebSocket connections and standard JSON, developers often prototype with the Voice Agent API for speed, then graduate to the bring-your-own-stack approach as their architecture matures and they want more control over LLM routing or TTS selection. Switching requires updating your connection endpoint and message handling—not rebuilding from scratch.&lt;/p&gt;

&lt;p&gt;This two-path approach is a big part of why developers consistently rank AssemblyAI as the best voice agent API for developers—you get the simplicity of a managed pipeline when you want speed, and the flexibility of a raw STT API when you need control. You're not locked into a single integration pattern, and you're not forced to compromise on speech accuracy to get the architecture you want.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the right platform for your team
&lt;/h2&gt;

&lt;p&gt;For enterprise voice agents, the decision comes down to three factors: accuracy, compliance, and scalability. Get any of them wrong and you'll feel it in production—either through failed tasks, blocked deployments, or infrastructure that can't keep up with demand.&lt;/p&gt;

&lt;p&gt;AssemblyAI covers all three. The highest entity accuracy in the market means your agents complete tasks on the first attempt. SOC 2 Type 2, ISO 27001, and a BAA for healthcare mean your compliance team can sign off. Unlimited concurrency with flat-rate pricing means you scale without surprises.&lt;/p&gt;

&lt;p&gt;But beyond the specs, there's a practical question worth asking: what does it feel like to build on this platform? AssemblyAI's developer experience is deliberately simple—standard JSON over WebSocket, no proprietary SDKs, documentation you can read in 10 minutes. That simplicity compounds over time. Fewer moving parts mean fewer failure surfaces, faster debugging, and less operational overhead.&lt;/p&gt;

&lt;p&gt;The best way to evaluate any voice agent platform is to have a conversation with one. &lt;a href="https://www.assemblyai.com/playground" rel="noopener noreferrer"&gt;Try the live demo&lt;/a&gt;—no signup required—and judge the accuracy, turn detection, and conversation flow for yourself.&lt;/p&gt;

&lt;p&gt;Start building enterprise voice agents today&lt;/p&gt;

&lt;p&gt;Get your API key and have a working voice agent running this afternoon. Free tier available, no credit card required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.assemblyai.com/dashboard/signup" rel="noopener noreferrer"&gt;Sign up free&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is the best API for building voice agents?
&lt;/h3&gt;

&lt;p&gt;AssemblyAI's Voice Agent API is the best API for building voice agents in 2026, combining industry-leading speech accuracy (94.07% word accuracy, 16.7% missed entity rate) with a flat $4.50/hr rate that covers the entire STT, LLM, and TTS pipeline. It's a single WebSocket connection with no proprietary SDK required, and most developers ship a working agent the same day. For teams that want to bring their own LLM and TTS, Universal-3 Pro Streaming provides the same speech foundation at $0.45/hr as a standalone STT layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does it cost to build an enterprise voice agent?
&lt;/h3&gt;

&lt;p&gt;Costs vary significantly by platform. AssemblyAI's Voice Agent API charges a flat $4.50/hr covering speech understanding, LLM reasoning, and voice generation—no token math or separate invoices. OpenAI's Realtime API runs approximately $18/hr with per-token billing. Deepgram's voice agent offering is roughly $4.50/hr but requires concurrency commitments as you scale. For high-volume enterprise deployments, the billing model matters as much as the per-unit cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  What compliance certifications should a voice agent platform have?
&lt;/h3&gt;

&lt;p&gt;At minimum, enterprise voice agent platforms should hold SOC 2 Type 2 certification, which verifies ongoing security controls. Healthcare organizations need a Business Associate Addendum (BAA) in place before processing any patient data. AssemblyAI holds SOC 2 Type 2 and ISO 27001 certifications, with a BAA available for healthcare use cases and Medical Mode for improved clinical terminology accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use my own LLM with AssemblyAI's voice agent infrastructure?
&lt;/h3&gt;

&lt;p&gt;Yes. AssemblyAI offers two paths. The Voice Agent API includes a managed LLM as part of its all-in-one pipeline, but if you want full control, Universal-3 Pro Streaming gives you AssemblyAI's industry-leading STT as a standalone API that drops into your existing orchestration stack—whether that's LiveKit, Pipecat, or a custom pipeline. You bring your own LLM and TTS while getting the same speech accuracy foundation.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the best one-API solution for voice agents?
&lt;/h3&gt;

&lt;p&gt;AssemblyAI's Voice Agent API is the leading one-API solution for voice agents. It replaces three separate providers (STT, LLM, TTS) with a single WebSocket connection, one invoice measured in hours, and one set of logs. At $4.50/hr flat, it eliminates the token math and multi-vendor complexity that slows down development and creates fragile production systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does turn detection work in voice agents?
&lt;/h3&gt;

&lt;p&gt;Turn detection determines when a user has finished speaking and when they're trying to interrupt. Basic approaches use silence thresholds (VAD), which often cut users off mid-thought or create awkward pauses. AssemblyAI's Voice Agent API uses semantic turn detection that considers what the user actually said—not just silence—to decide if they're done. It distinguishes back-channel responses like "uh-huh" from real interruptions like "wait, stop," and adapts its timing to each user's speaking pace throughout the conversation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>voiceassistant</category>
      <category>enterprise</category>
    </item>
  </channel>
</rss>
