Mid-Command Language Switching Broke Our Voice Pipeline: Here Is What We Changed

#ai #webdev #machinelearning #i18n

A user dictates: "Vytvoř tabulku clients with columns name, email and datum schůzky."

That single sentence contains Czech, English, and a mix of both. Our speech-to-text model did not know what hit it.

We build Voice Tables, an AI workspace you control with your voice. You describe what you need: a CRM, a project tracker, an inventory, and it builds the tables, docs, and data for you. The pipeline is straightforward: microphone > Whisper transcription > LLM function calling > structured output. It works well when users stick to one language.

They don't.

The problem

Whisper accepts a language parameter. Set it to cs and it transcribes Czech accurately. Set it to en and English works great. But real users in multilingual environments don't switch languages at sentence boundaries. They switch mid-phrase.

A Czech user might say column names in English because that's how they think about CRM fields. A German freelancer says "Erstelle eine Tabelle project deadlines mit priority und Fälligkeitsdatum." The technical terms stay English; the glue words follow the speaker's native language.

When we forced a single language parameter, three things broke:

Hallucinated transcriptions. Whisper tried to map English words into Czech phonemes and produced garbage. "Clients" became "kliends" or "klajnts", close enough to confuse the LLM but wrong enough to fail lookups.
Lost column names. The LLM downstream received mangled input and either invented column names or silently dropped them. Users got tables missing the fields they asked for.
Degraded confidence. Users who hit this once stopped trusting voice input and fell back to typing. That kills the core value proposition.

What we tried

Attempt 1: Auto-detect language per request. We ran a short language classifier on the audio before passing it to Whisper. Problem: a 3-second Czech sentence with two English words still classifies as Czech. The English words still get mangled.

Attempt 2: Whisper without a language hint. Whisper can auto-detect, but it picks one language for the entire audio segment. Same problem, different wrapper.

Attempt 3: Segment-level detection. We split the audio into chunks at silence boundaries, classified each chunk, and transcribed with the matching language. This improved accuracy for sentences that switched at natural pauses, but most mid-command switches happen without a pause. "Vytvoř tabulku clients" is one continuous breath.

What actually worked

We stopped trying to solve it in the audio layer.

The fix was letting Whisper transcribe with its best guess, then using the LLM to clean up the output before acting on it. We added a normalization step between transcription and function calling:

Transcribe with auto-detect. Let Whisper pick whatever language it wants. Accept that the transcript will be messy.
Post-transcription normalization. Send the raw transcript to the LLM with a system prompt: "This is a voice command from a multilingual user. Extract the intent, entity names, and field names. Normalize technical terms (table names, column names) to their likely intended form. Return structured JSON."
Confidence scoring. The LLM returns a confidence score for each extracted entity. Below a threshold, we ask the user to confirm: "Did you mean a column called clients or klienti?"
Feedback loop. When users confirm or correct, we log the mapping. Over time, per-user patterns emerge. This user always says "clients" in English even when speaking Czech. We feed these patterns back as context.

The key insight: Whisper's job is to capture phonemes. The LLM's job is to understand intent. Forcing Whisper to also understand intent by picking the right language was asking the wrong tool to do the wrong job.

Results

After shipping the normalization layer, we tracked three things:

Transcription-to-action accuracy went from ~72% to ~94% for multilingual users.
Correction rate (user manually fixing a voice command result) dropped by roughly 60%.
Voice input retention (users who tried voice and kept using it after 7 days) improved across all non-English locales.

The latency cost is one extra LLM call per voice command, roughly 200-400ms. Users don't notice because the table creation itself takes longer than that.

Edge cases we still fight

Proper nouns. A user says "Add a row for Müller GmbH" in an otherwise Czech sentence. The normalization layer sometimes over-corrects proper nouns into dictionary words. We partially solve this with an entity allowlist seeded from existing table data.

Numbers and dates. "Schůzka dvacátého June" mixes Czech ordinal with English month. Date parsing needs its own sub-pipeline that handles mixed-language temporal expressions.

Homophones across languages. The Czech word "most" means "bridge." The English word "most" means something completely different. Without sentence-level context, the normalizer can pick the wrong one. We bias toward the user's primary language for common words and toward English for technical terms.

Takeaway

If you are building a voice pipeline for a global user base, do not try to solve multilingual input at the speech-to-text layer. Accept messy transcriptions, normalize downstream, and build a feedback loop. The LLM is better at resolving ambiguity than any audio classifier.

We are still iterating on this at Voice Tables. If you are working on something similar, we would be curious to compare notes.

Top comments (1)

Marcus Chen • Jul 1

We hit the same wall with code-switching, and what helped was giving up on a perfect transcript and moving the correction downstream. Two changes. First, we stopped trusting one language decision per utterance and kept per-word timestamps plus confidence from Whisper, so the low-confidence spans (almost always the English technical terms sitting inside a non-English sentence) got flagged instead of silently mangled. Second, our function-calling step already knows the target vocabulary, the column names and field types the user can actually create, so we let that step reconcile the garbled token against the known set rather than trust the ASR string. "kliends" resolves to "clients" when clients is one of a dozen legal values, and stays flagged when it is open text. It does not fix ASR, it just stops ASR errors from reaching the stage that cannot recover from them. The mid-phrase switch was the exact case that broke us.