<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gladia</title>
    <description>The latest articles on DEV Community by Gladia (@gladia-io).</description>
    <link>https://dev.to/gladia-io</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F13635%2F781b0ed6-e2c7-42e9-8554-732fc1870b0f.png</url>
      <title>DEV Community: Gladia</title>
      <link>https://dev.to/gladia-io</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gladia-io"/>
    <language>en</language>
    <item>
      <title>Introducing Solaria-3: The most accurate speech-to-text model for European languages</title>
      <dc:creator>Jean-Louis</dc:creator>
      <pubDate>Wed, 10 Jun 2026 22:00:00 +0000</pubDate>
      <link>https://dev.to/gladia-io/introducing-solaria-3-the-most-accurate-speech-to-text-model-for-european-languages-1b86</link>
      <guid>https://dev.to/gladia-io/introducing-solaria-3-the-most-accurate-speech-to-text-model-for-european-languages-1b86</guid>
      <description>&lt;p&gt;Today we're releasing Solaria-3 – the new #1 among leading speech-to-text providers on business audio and conversational speech, delivering the strongest accuracy on real English customer calls of any model tested. It is our best model to date, which we trained for the audio our customers deal with in real life: calls with background noise, people talking over each other, teams switching between a few languages in one meeting.&lt;/p&gt;

&lt;p&gt;Here's why it exists: For years we'd watch voice models top some public leaderboard. The moment you run it on real customer recordings, the accuracy falls apart. Sub-4% WER on LibriSpeech, then 15% on a sales call with a non-native speaker and a noisy room. The benchmarks weren't wrong. They were just measuring clean, scripted audio that no enterprise has ever recorded.&lt;/p&gt;

&lt;p&gt;So we built Solaria-3 to close that gap, and tested it against every major provider on the public benchmarks and on our own dataset of real customer calls annotated by humans. Solaria-3 ranks #1 in accuracy on the conditions that break other models. A model for multilingual Europe, built by a European player.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Solaria-3 is Gladia's best-in-class and most accurate speech-to-text model for European languages, built for noisy, accented, multi-speaker production audio.&lt;/li&gt;
&lt;li&gt;It ranks #1 on business audio beating every major provider across the board.&lt;/li&gt;
&lt;li&gt;It improves over Solaria-1 across five languages most popular amongst our users: English, French, German, Spanish, Italian.&lt;/li&gt;
&lt;li&gt;Solaria-1 still wins on clean read-speech, formal audio, and 100+ language coverage. The two models are built to work together, not to replace each other.&lt;/li&gt;
&lt;li&gt;Solaria-3 comes with our usual compliance coverage (SOC 2 Type II, HIPAA, GDPR, ISO 27001) and is available on both EU and US clusters with full data sovereignty.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How we measure accuracy
&lt;/h2&gt;

&lt;p&gt;Most public benchmarks are measured on clean, studio-quality read speech. Take LibriSpeech, the most widely cited benchmark. It consists of audiobook recordings: a single speaker, no background noise, careful enunciation. These conditions don't exist in production. So we evaluated Solaria-3 on two types of data:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Public benchmarks:&lt;/strong&gt; Earnings22 (financial and business speech), Switchboard (conversational telephone audio), Common Voice (diverse accents and speakers), FLEURS (clean multilingual audio), VoxPopuli (parliamentary speech across EU languages) and Multilingual LibriSpeech (included for reference despite its limits). These allow direct comparison with other providers, and the harder ones among them are where the benchmark results come closest to reflecting real production audio: noisy, spontaneous, and conversational.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gladia's internal production dataset:&lt;/strong&gt; real customer recordings across five European languages. This is the closest thing to what you'll see in your pipeline, and we lean on it because public benchmarks can be gamed: it's a lot harder to overfit to audio nobody else has.&lt;/p&gt;

&lt;p&gt;All benchmark results are published at &lt;a href="https://www.gladia.io/solaria-3" rel="noopener noreferrer"&gt;gladia.io/solaria-3&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Solaria-3?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  #1 on real English audio
&lt;/h3&gt;

&lt;p&gt;On our internal English production dataset, made up of professional meeting recordings and customer support calls, Solaria-3 achieves 9.6% WER, placing it at the top of the field and showing a 26% improvement over Solaria-1 (12.9%).&lt;/p&gt;

&lt;p&gt;On Earnings22 Cleaned AA, the industry standard for financial and business speech, Solaria-3 ranks #1 at 6.4% WER — the only model under 7%, ahead of AssemblyAI (6.9%), ElevenLabs (7.7%), Speechmatics (7.8%), Mistral (7.9%), and Deepgram (12.0%).&lt;/p&gt;

&lt;p&gt;The gains show up most on the audio that breaks other models: fast-paced multi-speaker calls, non-native accented English, and dense domain vocabulary.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 1: 15-minute earnings call (Qudian Q3 2021)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;td&gt;"Qudian's third quarter 2021 earnings conference..."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI&lt;/td&gt;
&lt;td&gt;4.7%&lt;/td&gt;
&lt;td&gt;Similar to reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;7.3%&lt;/td&gt;
&lt;td&gt;Similar to reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;8.5%&lt;/td&gt;
&lt;td&gt;Writes numbers as words throughout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;10.7%&lt;/td&gt;
&lt;td&gt;"cugen's third quarter twenty twenty one earnings..."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Deepgram mangles the company name and writes all numbers as words, which is the kind of error that makes downstream parsing unreliable on every financial call it processes.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 2: 20-minute earnings briefing, non-native English speaker (TDK Q3 FY2022)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;11.2%&lt;/td&gt;
&lt;td&gt;#1 overall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;11.6%&lt;/td&gt;
&lt;td&gt;Similar to reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;13.2%&lt;/td&gt;
&lt;td&gt;Similar to reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI&lt;/td&gt;
&lt;td&gt;13.5%&lt;/td&gt;
&lt;td&gt;Similar to reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;16.0%&lt;/td&gt;
&lt;td&gt;Paraphrases instead of transcribing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;16.8%&lt;/td&gt;
&lt;td&gt;Writes fiscal year quarters as words&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Accented English is still one of the hardest problems in speech-to-text. Solaria-3 leads here even against Mistral, which performs well on clean audio but struggles with heavy accent and compressed audio combined.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 3: Internal production call, fintech discussion (PayPal merchant matching)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;Perfect transcript&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI&lt;/td&gt;
&lt;td&gt;7.8%&lt;/td&gt;
&lt;td&gt;Errors on technical terms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;9.4%&lt;/td&gt;
&lt;td&gt;Errors on technical terms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;10.9%&lt;/td&gt;
&lt;td&gt;Errors on technical terms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A real customer conversation about PayPal merchant transaction: informal register, domain jargon, incomplete sentences. Solaria-3 handles it perfectly. The difference is meaningful for any sales intelligence or conversation analytics tool where technical terms are the signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Superior accuracy on noise and conversational speech
&lt;/h3&gt;

&lt;p&gt;On noisy audio, Solaria-3 reaches 1.4% WER, beating most production providers, including AssemblyAI (2.1%), Deepgram (3.2%), and ElevenLabs (4.0%). On Switchboard — the hardest conversational telephone benchmark in the suite, using degraded 8kHz phone audio — Solaria-3 is #1 at 33.9% WER, the only model under 35%.&lt;/p&gt;

&lt;p&gt;The Switchboard result is particularly significant: ElevenLabs reaches 55.2% WER on this benchmark. That is a critical failure on the kind of audio that contact centers process millions of hours of every day.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 1: Real-world background noise (Hugging Face database)
&lt;/h4&gt;

&lt;p&gt;Reference: "The actual primary rainbow observed is said to be the effect of superimposition of a number of bows."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;td&gt;"...superimposition of a number of bones"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speechmatics&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;td&gt;"...superimposition of a number of bowls"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;4.2%&lt;/td&gt;
&lt;td&gt;"...superimposition of a number of bones"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;12.5%&lt;/td&gt;
&lt;td&gt;"super imposition of a number of bones"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Background noise causes four providers to mishear "bows" as "bones" or "bowls." A substitution of such kind changes the meaning of the sentence entirely. This is exactly the class of error that WER on clean audio cannot predict.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 2: Heavy background noise, multi-sentence passage (Hugging Face database)
&lt;/h4&gt;

&lt;p&gt;Reference: "We are on a four-year mission. We didn't and it cost us the game. It can be very worrying. We need to regroup. Four policemen were wounded."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;3.4%&lt;/td&gt;
&lt;td&gt;Similar to reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speechmatics&lt;/td&gt;
&lt;td&gt;20.7%&lt;/td&gt;
&lt;td&gt;Hallucinates "Artificial intelligence"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI&lt;/td&gt;
&lt;td&gt;31.0%&lt;/td&gt;
&lt;td&gt;"It's not just made up by human"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;51.7%&lt;/td&gt;
&lt;td&gt;"we did it and it cost us the game... artificial intelligence"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;103.4%&lt;/td&gt;
&lt;td&gt;Hallucinates entire additional sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;ElevenLabs reaches 103% WER, meaning it hallucinated more words than were actually spoken. Under noisy conditions, the failure mode is not just inaccuracy; it is confabulation. Models that hallucinate content on degraded audio are unsuitable for any use case where faithfulness to what was said is critical.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 3: Switchboard, degraded 8kHz telephone audio
&lt;/h4&gt;

&lt;p&gt;Reference: "yeah not not even that much probably yeah"&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;"Yeah, not not even that much probably. Yeah."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AssemblyAI&lt;/td&gt;
&lt;td&gt;62.5%&lt;/td&gt;
&lt;td&gt;Hallucinates "Well, that would be—"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;87.5%&lt;/td&gt;
&lt;td&gt;Hallucinates "Well, that would be a bit"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;87.5%&lt;/td&gt;
&lt;td&gt;Hallucinates "well that would be yeah be a time"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;100.0%&lt;/td&gt;
&lt;td&gt;Hallucinates "Well, that would be a good time."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;100.0%&lt;/td&gt;
&lt;td&gt;Hallucinates "it would be it"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every other model hallucinates words that were never spoken. On phone-quality audio, hallucination is the primary failure mode, and it's the hardest to catch in production because the output looks plausible.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Examples are individual utterances chosen to illustrate failure modes, not aggregate scores. Average WER across all tested audio is reported in the benchmarks section.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Most accurate model for European languages
&lt;/h3&gt;

&lt;p&gt;Multilingual accuracy has been core to Gladia since day one. That's why Solaria-1 supports 100+ languages. Yet Solaria-3 extends that commitment with a focused push on European production quality: consistent improvement over Solaria-1 across English, French, German, Spanish, and Italian, measured on our own internal production dataset.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Real customer audio&lt;/th&gt;
&lt;th&gt;Common Voice 24&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;English (EN)&lt;/td&gt;
&lt;td&gt;−26%&lt;/td&gt;
&lt;td&gt;−16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;French (FR)&lt;/td&gt;
&lt;td&gt;−18%&lt;/td&gt;
&lt;td&gt;−19%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Italian (IT)&lt;/td&gt;
&lt;td&gt;−10%&lt;/td&gt;
&lt;td&gt;−12%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spanish (ES)&lt;/td&gt;
&lt;td&gt;−9%&lt;/td&gt;
&lt;td&gt;≈ flat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;German (DE)&lt;/td&gt;
&lt;td&gt;−3%&lt;/td&gt;
&lt;td&gt;−13%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gains show up on the vocabulary that matters most in production: proper nouns, domain terms, place names, and precise verbs where a single wrong word changes the meaning of a sentence.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 1: Challenging accent, EN (Common Voice)
&lt;/h4&gt;

&lt;p&gt;Reference: "Thus the Byzantines were forced to fight alone."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speechmatics&lt;/td&gt;
&lt;td&gt;12.5%&lt;/td&gt;
&lt;td&gt;"focused" instead of "forced"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;"Thus the bison tens were focused to fight lone"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;"Thus the Bison Tens were focused to fight lone"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three independent errors on an 8-word sentence: a proper noun mangled, a verb wrong, an adverb truncated. This is not an edge case. It is representative of what happens to accented speech on models not optimised for it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 2: Spanish proper noun (Common Voice)
&lt;/h4&gt;

&lt;p&gt;Reference: "Al acabar la temporada volvió al Alcorcón."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;Perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;14.3%&lt;/td&gt;
&lt;td&gt;"volvió al corcón"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;14.3%&lt;/td&gt;
&lt;td&gt;"volvió al Corcón"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepgram&lt;/td&gt;
&lt;td&gt;28.6%&lt;/td&gt;
&lt;td&gt;"volvió al al corcón"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ElevenLabs&lt;/td&gt;
&lt;td&gt;28.6%&lt;/td&gt;
&lt;td&gt;"volvió al, al Corcón"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Alcorcón is a city of 170,000 people near Madrid. Every provider except Solaria-3 drops the "Al" prefix, producing a word that does not exist. For any application involving Spanish place names, including logistics, customer service, and local business, this class of error matters.&lt;/p&gt;

&lt;h4&gt;
  
  
  Example 3: Conversational French with stuttering (internal dataset)
&lt;/h4&gt;

&lt;p&gt;Reference: "Non, observe, attends et émerveille-toi... il s'agit, il-il-il advient, pardon, il advient ce que le bébé ou le fœtus même aurait eu besoin..."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-3&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;Faithfully captures "il, il, il advient..."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solaria-1&lt;/td&gt;
&lt;td&gt;15.8%&lt;/td&gt;
&lt;td&gt;Smooths over the hesitations, drops words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral&lt;/td&gt;
&lt;td&gt;15.8%&lt;/td&gt;
&lt;td&gt;Smooths over the hesitations, drops words&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In verbatim transcription (meeting notes, medical records, legal depositions) the hesitations are not noise to be cleaned. They are part of the record. Solaria-3 captures them; most other models silently delete them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Solaria-1 is still the better choice
&lt;/h2&gt;

&lt;p&gt;We don't think Solaria-3 should replace Solaria-1 everywhere. Here's where Solaria-1 still wins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multilingual LibriSpeech:&lt;/strong&gt; Solaria-3 scores 8.0% WER against Solaria-1's 5.9%, a 36% relative regression. It's a clean read-speech benchmark spanning a lot of languages, so if your audio is mostly clean, read-aloud material across a wide language range, Solaria-1 is the better pick.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VoxPopuli Cleaned AA:&lt;/strong&gt; The gap holds on formal, institutional audio too. Solaria-3 scores 2.9% to Solaria-1's 2.2%, a 32% relative regression, and Solaria-1 stays ahead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broad multilingual coverage:&lt;/strong&gt; Solaria-3 is tuned for five languages: EN, FR, DE, ES, and IT. Solaria-1 covers 100+, including 42 that no other API supports. If you need rare-language coverage or real multilingual breadth, Solaria-1 is still the right call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The two models are built to work together, not to replace each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try Solaria-3 for free
&lt;/h2&gt;

&lt;p&gt;Solaria-3 is live today in Gladia's API. It's free with code &lt;code&gt;TRY-SOLARIA-3&lt;/code&gt; at checkout. Go to &lt;strong&gt;Billing → Add payment method → Add promo code&lt;/strong&gt;. The code is redeemable once per account for async transcription.&lt;/p&gt;

&lt;p&gt;To switch to Solaria-3 in your API calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In your transcription request, set the model parameter:&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://api.gladia.io/v2/transcription &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"x-gladia-key: YOUR_API_KEY"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "audio_url": "https://your-audio-file.com/audio.mp3",
    "model": "solaria-3"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the free trial, Solaria-3 is billed at standard API rates. Full documentation is available at &lt;a href="https://docs.gladia.io" rel="noopener noreferrer"&gt;docs.gladia.io&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you have questions or want to share feedback on how Solaria-3 performs on your audio, reach out at &lt;a href="mailto:support@gladia.io"&gt;support@gladia.io&lt;/a&gt; or join the &lt;a href="https://discord.gg/gladia" rel="noopener noreferrer"&gt;Gladia Discord&lt;/a&gt;. Solaria-1 remains available and fully supported.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Solaria-3 the most accurate speech-to-text model?
&lt;/h3&gt;

&lt;p&gt;On the audio that matters most in production business calls, Solaria-3 ranks #1 on most benchmarks Gladia tested, leading on Earnings22 (6.4% WER), Switchboard (33.9% WER), and Gladia's internal English production dataset (9.6% WER). It is not #1 everywhere: Mistral Voxtral edges it out on noisy audio (1.0% vs. 1.4%), and Solaria-1 remains more accurate on clean read-speech and formal institutional audio.&lt;/p&gt;

&lt;h3&gt;
  
  
  What languages does Solaria-3 support?
&lt;/h3&gt;

&lt;p&gt;Solaria-3 is optimized for five European languages: English, French, German, Spanish, and Italian. For broader coverage, Solaria-1 supports 100+ languages, including 42 not available through any other API.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use Solaria-3 or Solaria-1?
&lt;/h3&gt;

&lt;p&gt;Use Solaria-3 for European real-world audio — business calls, contact centers, and noisy or accented recordings. Use Solaria-1 for clean read-speech, formal institutional audio, or languages outside the core five. The two models are designed to complement each other, not replace.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Solaria-3 compare to Deepgram, AssemblyAI, and ElevenLabs?
&lt;/h3&gt;

&lt;p&gt;On Earnings22, Solaria-3 (6.4% WER) beats AssemblyAI (6.9%), ElevenLabs (7.7%), and Deepgram (12.0%). On Switchboard, it reaches 33.9% WER while ElevenLabs reaches 55.2%. On noisy audio it outperforms all three, though Mistral Voxtral leads overall.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where does Solaria-3 underperform Solaria-1?
&lt;/h3&gt;

&lt;p&gt;On Multilingual LibriSpeech (8.0% vs. 5.9%, a 36% relative regression) and VoxPopuli (2.9% vs. 2.2%, a 32% relative regression) — both clean, formal read-speech benchmarks. These regressions are published openly. For that kind of audio, Solaria-1 is the better choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  How was Solaria-3 benchmarked?
&lt;/h3&gt;

&lt;p&gt;On public benchmarks (Earnings22, Switchboard, Common Voice, FLEURS, VoxPopuli, and Multilingual LibriSpeech) for direct comparison with other providers, and on Gladia's internal dataset of real customer recordings across five European languages — human-annotated and far harder to overfit to than public data.&lt;/p&gt;

&lt;h3&gt;
  
  
  How much does Solaria-3 cost and how do I try it?
&lt;/h3&gt;

&lt;p&gt;Solaria-3 is free for 5 days with the code &lt;code&gt;TRY-SOLARIA-3&lt;/code&gt; (Billing → Add payment method → Add promo code). After the trial, it's billed at standard API rates. To use it, set &lt;code&gt;"model": "solaria-3"&lt;/code&gt; in your transcription request.&lt;/p&gt;

</description>
      <category>speechtotext</category>
      <category>asr</category>
      <category>asynctranscription</category>
    </item>
    <item>
      <title>Building real-time multilingual ASR with code-switching</title>
      <dc:creator>Jean-Louis</dc:creator>
      <pubDate>Sun, 31 May 2026 22:00:00 +0000</pubDate>
      <link>https://dev.to/gladia-io/building-real-time-multilingual-asr-with-code-switching-3561</link>
      <guid>https://dev.to/gladia-io/building-real-time-multilingual-asr-with-code-switching-3561</guid>
      <description>&lt;p&gt;When a speaker switches languages, traditional models keep outputting the previous one for several hundred milliseconds before catching up, producing garbled text and inaccurate timestamps. The obvious fix is a large multilingual model. But those are expensive to run, awkward to deploy on-device, and still stumble on fast switches.&lt;/p&gt;

&lt;p&gt;Bruno Hays, a Lead ML Speech Engineer at Gladia, went the other way. In his &lt;a href="https://www.youtube.com/watch?v=zNds-EIDwWo" rel="noopener noreferrer"&gt;original research&lt;/a&gt;, instead of one heavy model that tries to know every language at once, he built a lightweight, modular ensemble that routes between small, specialized models and runs efficiently on standard CPUs.&lt;/p&gt;

&lt;p&gt;Here's how the system works, how it stacks up against leading commercial and open-source alternatives, and where it still hits a wall. The code is fully open source.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Real-time multilingual transcription lags on code-switching: streaming models keep outputting the old language for a few hundred milliseconds, producing garbled text and bad timestamps.&lt;/li&gt;
&lt;li&gt;The fix is a lightweight, CPU-friendly ensemble that routes between small (~100M param) monolingual streaming Zipformer models instead of one heavy multilingual model.&lt;/li&gt;
&lt;li&gt;On inter-utterance code-switching, it hit ~13% WER, beating Deepgram Nova-3 (~14%) and the much larger local Voxtral-Mini-4B (~21%).&lt;/li&gt;
&lt;li&gt;On intra-utterance switching, VAD can't segment switches fast enough and WER climbs to ~41%, behind cloud APIs like ElevenLabs (~26%).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture: A Modular Ensemble Instead of One Monolithic Model
&lt;/h2&gt;

&lt;p&gt;The core idea is simple: rather than asking one model to be fluent in every language at once, hand each language to a specialist and add a thin layer of logic to decide who should be listening.&lt;/p&gt;

&lt;p&gt;The pipeline replaces high-parameter monolithic models with a modular ecosystem of three specialized components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VAD (Voice Activity Detection):&lt;/strong&gt; Driven by Silero V6 to identify speech boundaries with minimal latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ASR (Automatic Speech Recognition):&lt;/strong&gt; Streaming Zipformer models served via the sherpa-onnx framework. At only ~100M parameters, these models are purpose-built for efficient, CPU-bound transcription.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LID (Language Identification):&lt;/strong&gt; Powered by Speechbrain's lang-id-voxlingua107-ecapa for linguistic detection across up to 107 languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Asynchronous Rollback Pipeline
&lt;/h3&gt;

&lt;p&gt;To minimize language lag during transitions, the system implements an Asynchronous Rollback Pipeline. It works in three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Immediate transcription:&lt;/strong&gt; The system transcribes instantly using the currently active monolingual ASR engine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asynchronous monitoring:&lt;/strong&gt; Following each speech boundary identified by the VAD, the system triggers LID checks on expanding audio windows in the background.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The rollback trigger:&lt;/strong&gt; If the LID detects a language switch with high confidence, the system:

&lt;ul&gt;
&lt;li&gt;Switches the active ASR stream to the new language model&lt;/li&gt;
&lt;li&gt;Rolls back the transcript to the exact segment boundary where the switch occurred&lt;/li&gt;
&lt;li&gt;Re-infers the buffered audio for that specific range&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The corrected text is then injected into the stream. The user only sees the wrong-language artifacts briefly in partials, while the final transcripts feature clean language boundaries.&lt;/p&gt;




&lt;h2&gt;
  
  
  How We Evaluated the Pipeline
&lt;/h2&gt;

&lt;p&gt;Speech recognition performance isn't a single number — it shifts with context, speaker, and how messy the audio gets. The pipeline was stress-tested across three datasets that get progressively harder, each targeting a different transcription scenario.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Google FLEURS (Monolingual Baseline)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/google/fleurs" rel="noopener noreferrer"&gt;Google FLEURS&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it measures:&lt;/strong&gt; High-quality, single-language utterances used to measure the baseline performance of the underlying streaming models. Results show performance degradation on monolingual datasets can be observed for all providers when moving away from native monolingual setups.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Synthetic FLEURS Blend (Inter-Utterance Code-Switching)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/BrunoHays/fleurs_code_switching_test" rel="noopener noreferrer"&gt;BrunoHays/fleurs_code_switching_test&lt;/a&gt;, a custom-compiled subset derived from the original Google FLEURS corpus.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it measures:&lt;/strong&gt; Built by mixing samples from eight major languages (en, fr, es, de, ru, it, pt, nl) chained sequentially. Measures inter-utterance code-switching, where a speaker switches languages between sentences or distinct speech boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Bangor-Miami Corpus (Intra-Utterance Code-Switching)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset:&lt;/strong&gt; &lt;a href="https://huggingface.co/datasets/BrunoHays/Bangor-Miami-Spanish-English-Corpus" rel="noopener noreferrer"&gt;BrunoHays/Bangor-Miami-Spanish-English-Corpus&lt;/a&gt;, adapted from the &lt;a href="https://mozilladatacollective.com/datasets/cmmfulo4r018bnz07py4q9t09" rel="noopener noreferrer"&gt;Mozilla Data Collective version&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What it measures:&lt;/strong&gt; Conversational dataset capturing informal speech among bilingual speakers. Measures intra-utterance code-switching, where people randomly mix words from different languages inside the same sentence (e.g., Spanglish) with no acoustic pauses between language flips.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Benchmarks and Models Used
&lt;/h2&gt;

&lt;p&gt;For this solution, the following open-source streaming Zipformer models were used with a 640ms chunk size:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Russian (ru):&lt;/strong&gt; &lt;a href="https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-small-ru-vosk-2025-08-16" rel="noopener noreferrer"&gt;Alphacephei small Russian streaming Zipformer model&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Other languages (en, fr, es, de, it, pt, nl):&lt;/strong&gt; &lt;a href="https://huggingface.co/Banafo/Kroko-ASR" rel="noopener noreferrer"&gt;KrokoAI's streaming Zipformer models&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared against:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deepgram Nova-3 (multilingual mode)&lt;/li&gt;
&lt;li&gt;ElevenLabs Scribe v2&lt;/li&gt;
&lt;li&gt;MistralAI Voxtral-Mini-4B-Realtime (480ms target streaming delay)&lt;/li&gt;
&lt;li&gt;AssemblyAI u3-rt-pro &lt;em&gt;(note: does not support Dutch and Russian)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Inter-Utterance Code-Switching (Synthetic Dataset)
&lt;/h3&gt;

&lt;p&gt;When language shifts happen at natural speech boundaries, this architecture performs well. The routing logic achieved a &lt;strong&gt;~13% WER&lt;/strong&gt;, outperforming every other solution tested — including Deepgram Nova-3 (~14%) and the larger local Voxtral-Mini-4B (~21%).&lt;/p&gt;

&lt;h3&gt;
  
  
  Intra-Utterance Code-Switching (Miami Corpus)
&lt;/h3&gt;

&lt;p&gt;Here, the limitations of a VAD-reliant architecture become apparent. Because language switches happen faster than a VAD speech boundary can segment them, the rollback mechanism is not triggered effectively.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This system's WER degraded to &lt;strong&gt;~41%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;It was outperformed by cloud APIs like ElevenLabs (~26%) and Deepgram (~29%).&lt;/li&gt;
&lt;li&gt;It still outperformed the local Voxtral-Mini-4B, which collapsed with a ~76% WER.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Detailed WER Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  WER Benchmark Comparison — FLEURS Dataset
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;Voxtral Mini&lt;/th&gt;
&lt;th&gt;Deepgram Nova-3&lt;/th&gt;
&lt;th&gt;ElevenLabs Scribe v2&lt;/th&gt;
&lt;th&gt;AAI u3-rt-pro&lt;/th&gt;
&lt;th&gt;This Work&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Italian&lt;/td&gt;
&lt;td&gt;3.30%&lt;/td&gt;
&lt;td&gt;7.10%&lt;/td&gt;
&lt;td&gt;2.00%&lt;/td&gt;
&lt;td&gt;2.50%&lt;/td&gt;
&lt;td&gt;4.30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Russian&lt;/td&gt;
&lt;td&gt;5.20%&lt;/td&gt;
&lt;td&gt;10.20%&lt;/td&gt;
&lt;td&gt;7.40%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;14.30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Portuguese&lt;/td&gt;
&lt;td&gt;4.90%&lt;/td&gt;
&lt;td&gt;10.30%&lt;/td&gt;
&lt;td&gt;3.90%&lt;/td&gt;
&lt;td&gt;3.90%&lt;/td&gt;
&lt;td&gt;9.20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;12.40%&lt;/td&gt;
&lt;td&gt;10.20%&lt;/td&gt;
&lt;td&gt;4.30%&lt;/td&gt;
&lt;td&gt;3.40%&lt;/td&gt;
&lt;td&gt;13.20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;French&lt;/td&gt;
&lt;td&gt;8.80%&lt;/td&gt;
&lt;td&gt;10.60%&lt;/td&gt;
&lt;td&gt;5.50%&lt;/td&gt;
&lt;td&gt;4.10%&lt;/td&gt;
&lt;td&gt;9.90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dutch&lt;/td&gt;
&lt;td&gt;8.20%&lt;/td&gt;
&lt;td&gt;13.30%&lt;/td&gt;
&lt;td&gt;5.30%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;11.90%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;German&lt;/td&gt;
&lt;td&gt;6.00%&lt;/td&gt;
&lt;td&gt;10.00%&lt;/td&gt;
&lt;td&gt;3.90%&lt;/td&gt;
&lt;td&gt;3.40%&lt;/td&gt;
&lt;td&gt;11.00%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spanish&lt;/td&gt;
&lt;td&gt;3.10%&lt;/td&gt;
&lt;td&gt;5.90%&lt;/td&gt;
&lt;td&gt;2.70%&lt;/td&gt;
&lt;td&gt;2.40%&lt;/td&gt;
&lt;td&gt;4.70%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code-switch&lt;/td&gt;
&lt;td&gt;21.30%&lt;/td&gt;
&lt;td&gt;14.30%&lt;/td&gt;
&lt;td&gt;13.60%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;13.20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Miami&lt;/td&gt;
&lt;td&gt;76.30%&lt;/td&gt;
&lt;td&gt;28.60%&lt;/td&gt;
&lt;td&gt;26.50%&lt;/td&gt;
&lt;td&gt;34.00%&lt;/td&gt;
&lt;td&gt;41.00%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.50%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.70%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.40%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.30%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.80%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Code-Switch Benchmark Summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Voxtral Mini&lt;/th&gt;
&lt;th&gt;Deepgram Nova-3&lt;/th&gt;
&lt;th&gt;ElevenLabs Scribe v2&lt;/th&gt;
&lt;th&gt;AAI u3-rt-pro&lt;/th&gt;
&lt;th&gt;This Work&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;FLEURS accuracy (averaged)&lt;/td&gt;
&lt;td&gt;6.50%&lt;/td&gt;
&lt;td&gt;9.70%&lt;/td&gt;
&lt;td&gt;4.40%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;9.80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simulated inter-utterance code-switch (FLEURS)&lt;/td&gt;
&lt;td&gt;21.30%&lt;/td&gt;
&lt;td&gt;14.30%&lt;/td&gt;
&lt;td&gt;13.60%&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13.20%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-life intra-utterance code-switch (Miami)&lt;/td&gt;
&lt;td&gt;76.30%&lt;/td&gt;
&lt;td&gt;28.60%&lt;/td&gt;
&lt;td&gt;26.50%&lt;/td&gt;
&lt;td&gt;34.00%&lt;/td&gt;
&lt;td&gt;41.00%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forl8nov6yce4hl6tm3ft.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forl8nov6yce4hl6tm3ft.png" alt="WERs on monolingual and code-switching datasets" width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83m8jph05j2gg1niiu6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83m8jph05j2gg1niiu6e.png" alt="WERs on monolingual datasets (FLEURS)" width="800" height="528"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations and Future Outlook
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The intra-utterance gap:&lt;/strong&gt; Relying on VAD segment boundaries means rapid, mid-sentence word mixing slips through. In real life, only a few such blends exist, like Spanglish or Singlish (Singapore). A promising solution is to treat the blend as its own "new language" and use a dedicated bilingual model for that stream.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Monolingual constraints:&lt;/strong&gt; The overall accuracy is bounded by the maturity of the underlying open-source Zipformer models available for each language.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Case for Small, Specialized Models
&lt;/h3&gt;

&lt;p&gt;Handling a large number of languages within a single model requires significant knowledge, which translates to higher parameter counts. For local, on-device ASR, trying to build a single model that does everything can be inefficient.&lt;/p&gt;

&lt;p&gt;The results of this implementation suggest that the future of local multilingual ASR could lie in &lt;strong&gt;orchestrating small, hyper-specialized models via an intelligent routing layer&lt;/strong&gt;. Keeping the individual models small allows for a lightweight system that deals with inter-utterance code-switching better than competing open-source alternatives, despite their larger size.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/gladiaio/realtime-multilingual-asr-router" rel="noopener noreferrer"&gt;code is fully open source&lt;/a&gt;. Check out Bruno's &lt;a href="https://www.youtube.com/watch?v=zNds-EIDwWo" rel="noopener noreferrer"&gt;interactive demo&lt;/a&gt; to see the project in action.&lt;/p&gt;




&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is code-switching in automatic speech recognition?&lt;/strong&gt;&lt;br&gt;
Code-switching is when a speaker alternates between two or more languages. In real-time ASR it causes "language lag" — streaming systems keep outputting the previous language for several hundred milliseconds after a switch, producing garbled text and inaccurate timestamps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the difference between inter-utterance and intra-utterance code-switching?&lt;/strong&gt;&lt;br&gt;
Inter-utterance code-switching is when a speaker switches languages between sentences or distinct speech boundaries. Intra-utterance code-switching is when words from different languages are mixed inside the same sentence (e.g., Spanglish), with no acoustic pauses between the language flips.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does the Asynchronous Rollback Pipeline reduce language lag?&lt;/strong&gt;&lt;br&gt;
It transcribes immediately using the active monolingual ASR engine, runs language-identification checks on expanding audio windows in the background after each VAD speech boundary, and — when it detects a high-confidence switch — swaps to the new language model, rolls back the transcript to the segment boundary, and re-infers the buffered audio. Users only see wrong-language artifacts briefly in partials; final transcripts have clean boundaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can small, specialized ASR models outperform large multilingual models?&lt;/strong&gt;&lt;br&gt;
On inter-utterance code-switching, yes. The small-model ensemble reached ~13% WER, outperforming Deepgram Nova-3 (~14%) and the larger local Voxtral-Mini-4B (~21%). On intra-utterance code-switching (Miami), it reached ~41% WER — behind cloud APIs but still ahead of Voxtral's ~76%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What are the limitations of a VAD-based code-switching pipeline?&lt;/strong&gt;&lt;br&gt;
Because it relies on VAD segment boundaries, rapid mid-sentence word mixing (such as Spanglish or Singlish) slips through, since switches happen faster than a VAD boundary can segment them. Accuracy is also bounded by the maturity of the available open-source Zipformer models for each language.&lt;/p&gt;

</description>
      <category>multilingual</category>
      <category>asr</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
