Your Singapore team is on a call. Someone says: "Eh, so we need to deploy this feature lah, but the database query very slow lor. How we optimize? Boleh ask the backend team?"
That sentence has English, Malay, and Singlish grammar patterns all mixed together.
Try to transcribe it with Otter.ai or Google Meet's built-in captions. You'll get something like: "Eh, so we need to deploy this feature la, but the database query very slow or how we optimize..."
The words are garbled. The meaning is lost. And if you're using that transcript as your meeting notes, you've got a mess.
This is code-switching. And it's breaking almost every AI transcription tool on the market.
What Is Code-Switching?
Code-switching is when a bilingual or multilingual speaker mixes two or more languages in a single conversation, often within the same sentence.
It's not broken English. It's not bad grammar. It's actually a sign of linguistic sophistication. Bilingual speakers code-switch because it's the most efficient way to communicate with other bilingual people in their community.
It's normal in:
Singapore: Singlish (English + Malay + Mandarin + Tamil grammar patterns)
Malaysia: Bahasa Rojak (English + Malay + Cantonese)
Philippines: Taglish (Tagalog + English)
Thailand: Denglish (English + Thai)
Indonesia: Bahasa Campur (Indonesian + English + regional languages)
Vietnam: Vietglish (Vietnamese + English)
Every Southeast Asian professional does this. It's how we communicate.
But here's the problem: AI transcription models were trained on monolingual speech. They've never seen this before.
Why AI Breaks on Code-Switching
Modern speech-to-text models work by:
Tokenization: Breaking audio into phonemes (individual sounds)
Language identification: Figuring out which language it is
Pattern matching: Matching sound patterns to known words
Grammar correction: Using language models to fix errors
Code-switching breaks at step 2.
When you say a sentence in Singlish, the AI hears English words and Malay grammar patterns simultaneously. The language identification model gets confused. Is this English or Malay? The model has to pick one. It picks wrong. Everything downstream breaks.
Here's a real example:
What was said: "Eh, cannot lah, server go down already."
What Google Transcribe hears: English (because the root words are English)
What Google Transcribe outputs: "Eh, cannot la server go down already" (it strips the Singlish particles because they don't fit English grammar)
What the speaker meant: "No, we can't do that right now, because the server has crashed."
The meaning was there. But the model didn't preserve it.
The Training Data Problem
Why does this happen?
Because the datasets used to train these models don't include code-switched speech.
Google, OpenAI, and other major AI labs trained their speech models on:
English audio (billions of hours)
Mandarin audio (billions of hours)
Spanish, French, German, etc. (hundreds of millions of hours each)
But Singlish? Taglish? Bahasa Campur?
There's almost no training data.
Why? Because code-switched speech is:
Hard to label (Is this English or Malay? Both? The labeler has to make a judgment call)
Not standardized (Singlish spoken in Singapore sounds different from Singlish in Malaysia)
Seen as "low prestige" by researchers (academic datasets tend to focus on formal, monolingual speech)
Computationally expensive to include (mixed-language models are harder to train)
So the models just ignore it. And when they encounter it, they fail.
Why This Matters for SE Asian Teams
Imagine you're a regional PM. You record a standup with your Singapore, Bangkok, and Manila teams. Everyone code-switches naturally — it's how they communicate best.
You use Otter.ai to transcribe. You get:
60% accuracy on the English parts
40% accuracy on the code-switched parts
Completely butchered grammar and meaning in the mixed sentences
Useless meeting notes
So you spend 20 minutes manually fixing the transcript. Or you just don't use it.
Either way, you've lost the main benefit of transcription: saving time.
Scale this across a team. You're losing hours every week to bad transcription.
For a team of 10, that's 500 hours a year of wasted time. For a 50-person team, it's 2,500 hours.
That's real money.
The Current Solutions (And Why They Don't Work)
Option 1: Use a tool built for your specific language
There are some tools built for Singlish or Taglish specifically. But they only work if you speak one code-switched language. If your team spans multiple countries, you're out of luck.
Option 2: Record separate videos in each language
Some teams do this. One person records the English parts, someone else records the Malay parts. This is absurd and doesn't reflect how people actually talk.
Option 3: Use Google Meet or Zoom's built-in captions
They're improving, but still 50-60% accurate on code-switched speech. Better than nothing, but still not usable for meeting notes.
Option 4: Hire someone to manually transcribe
Expensive and slow. But it works because humans can understand code-switching. A human transcriber gets the meaning right even if the grammar is mixed.
Option 5: Just don't transcribe
Most teams do this. They record meetings but don't transcribe them because the tools are so bad at code-switching.
How We're Solving This at BYSIK
When we started building BYSIK, we noticed this problem immediately.
Our first user was a Singapore startup with a team across SG, MY, and ID. They said: "Every transcription tool fails on our meetings because we speak Singlish. We just stopped using transcription."
That's when we realized: the existing solutions aren't built for Southeast Asia.
So we did something different:
1. We trained on code-switched speech
We built a dataset of actual code-switched audio from SE Asian professionals. Singlish, Taglish, Bahasa Campur, mixed Thai-English, all of it. We labeled it carefully and trained our speech-to-text model on this specific data.
2. We use language-agnostic embedding models
Instead of deciding "Is this English or Malay?" upfront, we use embedding models that represent words in a shared semantic space. "Lah" (Malay particle) and "already" (English particle) both mean roughly the same thing in context. The model learns this.
3. We added dialect and accent handling
Singlish from Singapore sounds different from Singlish from Malaysia. Bangkok Thai-English is different from Northern Thai-English. We built models that handle these variations.
4. We preserve the original speech patterns
Instead of "correcting" code-switched speech into monolingual grammar, we keep it as spoken. If you said "cannot lah," the transcript says "cannot lah," not "cannot." The meaning is preserved.
The result? 85%+ accuracy on code-switched speech, compared to 40-50% on existing tools.
Is it perfect? No. Code-switching is inherently ambiguous sometimes — even humans disagree on what was said. But it's accurate enough to be useful for meeting notes, which is the point.
The Bigger Picture: Why SE Asia Keeps Getting Left Behind
This code-switching problem is a microcosm of a bigger issue.
AI is built in the US and China. The training data is English, Mandarin, and a few other "high-resource" languages.
Everything else — including all of Southeast Asia's languages and dialects — gets treated as an edge case.
So we have tools that work great for monolingual English speakers in San Francisco. They're okay for Mandarin speakers in Shanghai. But for a multilingual team in Singapore? For developers in Manila who code-switch naturally? For regional teams that communicate across languages?
The tools fail.
And the assumption is: "That's an edge case. Most of the world speaks monolingual English or Mandarin anyway."
But Southeast Asia is 650 million people. It's not an edge case. It's a massive market that's been ignored because building for multilingual, code-switched speech is harder than building for monolingual English.
What Needs to Change
For researchers: Start collecting and publishing datasets of code-switched speech. It's harder than monolingual data, but it's important. Southeast Asia's languages matter.
For AI companies: Stop pretending code-switching is an edge case. Train your models on it. Your users in SE Asia deserve tools that work.
For regional companies: If existing tools don't work for you, you don't have to accept it. Build your own, or support tools (like BYSIK) that are built for your market.
For SE Asian founders: This is an opportunity. The entire region is using transcription tools that don't work for how we actually speak. That's a problem worth solving.
The Practical Takeaway
If you're running a team in Southeast Asia and you've been frustrated with transcription accuracy, now you know why.
It's not your audio quality. It's not your accent. It's not that you're speaking "wrong."
It's that the tools were built for a different market. They were trained on monolingual speech. Code-switching breaks them.
There are solutions now. Tools are getting better at handling multilingual and code-switched speech. If you've given up on transcription because it didn't work, try again. The tech has caught up.
And if you find a tool that actually understands how your team talks — that preserves meaning instead of "correcting" your language — stick with it. You've found something rare.
P.S. — If you want to geek out about the linguistics of code-switching, there's a whole field of research on it. Start here: Poplack's "The Bilingual's Linguistic System: Evidence for Asymmetric Competence." It's fascinating stuff.
And if you're building tools for Southeast Asia, feel free to reach out. I'm always interested in talking to founders who are solving regional problems instead of just copying the US.
Full disclosure: I founded BYSIK AI because of this exact problem. We're solving code-switching for transcription. But even if you use a different tool, I hope this helped you understand why transcription has been hard in SE Asia and why it's getting better.
Top comments (0)