This one still keeps me up at night.
A healthcare company in Bangladesh wanted a simple voice AI flow to confirm patient appointments. Call patients twenty-four hours before their visit, ask them to confirm, mark yes as confirmed, no as reschedule. Straightforward.
We deployed it. Day one, the system confirmed eight percent of appointments.
Eight percent.
The client called furious. People were confirming, but the system wasn’t registering it. I pulled the call recordings, confident this had to be a bug.
It wasn’t.
What I Heard in the Recordings
The AI asked patients to confirm their appointment. Patients replied with “Haa, thik ache,” meaning yes, it’s fine. The AI asked again. Patients repeated it louder. The AI still didn’t understand and eventually transferred or hung up.
Other patients said “Hmm.” Others said “Achha.” Others repeated the appointment time. Every one of them was confirming.
The AI heard none of it.
I listened to fifty calls. Same pattern every time. Patients were confirming perfectly. The AI just couldn’t understand how they were saying yes.
The Problem I Didn’t See Coming
I had built confirmation logic around English yes and no. The system expected words like yes, yeah, yep, no, nope.
But Bangladeshi patients don’t always say yes the way English speakers do. Even in English conversations, affirmation looks different.
They say haa. They say achha. They say thik ache. They make affirmative sounds like hmm. They repeat the time to show acknowledgment. They say “3 PM, yes, 3 PM.”
And no wasn’t just no either. It was na. Parbo na. Shombhob na. Or silence followed by an explanation.
I had built a system that only understood Western affirmation patterns in a country where people don’t naturally use them.
My First Bad Fix
I added Bangla words to the accepted list. Haa. Achha. Thik ache.
It helped a little, but not enough.
Real responses were messy. Mixed language. Filler sounds. Confirmations buried inside longer sentences. Indirect agreement like “Tomorrow 3 PM, I’ll be there.”
Keyword matching couldn’t handle that.
My Second Bad Fix
I added more keywords. Dozens of them.
That made things worse.
The system started confirming appointments people were trying to decline. It misread negations embedded in sentences. It marked yes when people said “yes, but…” and ignored the but. It crashed when it heard conflicting signals.
The client called again. This time angrier.
The Real Solution
I stopped trying to predict every possible word people might say.
Instead, I rebuilt the system around intent recognition.
The AI’s job was no longer to detect keywords. It was to understand intent.
Was the patient agreeing to attend, declining, unsure, or unclear?
That single shift changed everything.
Designing for Intent, Not Words
The system started analyzing the full response, not just individual tokens. Affirmative words, repetition of time, commitment phrases, and acknowledgment sounds were treated as confirmation. Negations, inability phrases, and reschedule requests were treated as declines. Questions and conditional language were treated as uncertainty.
Mixed English and Bangla stopped being a problem. “Yes korbo” was confirmation. “Parbo na tomorrow” was a decline. “Hmm, thik ache” was confirmation.
Context mattered more than vocabulary.
The Moment It Finally Worked
We replayed the same recordings through the new system.
“Haa, thik ache” became confirmed. “Hmm” became confirmed. “Tomorrow 10 AM… actually na, parbo na” became declined with a reschedule offer.
The AI stopped fighting patients and started understanding them.
What Changed in the Results
Before intent recognition, confirmation success was eight percent. Most calls ended in frustration. Almost every appointment required manual follow-up.
After intent recognition, confirmation success jumped above eighty percent. Call completion soared. Manual follow-ups dropped dramatically.
The client asked what changed.
I told them the truth. I stopped assuming how people say yes.
The Deeper Mistake I’d Made
I designed the system around my own communication habits.
When I confirm something, I say yes. So I assumed everyone does.
In Bangladesh, hmm is a valid yes. Achha means understood and agreed. Repeating information is confirmation. Silence followed by acknowledgment can mean tentative agreement.
None of that was in my original design because I’d never experienced it personally.
I had built an AI that understood me, not the users.
Lessons About Voice AI Across Cultures
Yes is not universal. Affirmation patterns vary widely by culture.
Code-switching is normal. People mix languages naturally, and your system must handle that.
Silence has meaning, but the meaning depends on context.
Testing with your own team is not testing. Real users will expose assumptions you didn’t know you had.
And most importantly, intent beats keywords every time.
The Follow-Up Problems That Appeared Next
Once confirmation worked, new issues surfaced. Date formats confused patients. Time formats confused them too. Doctor names were mispronounced so badly that patients didn’t recognize who they were seeing.
Each problem had the same root cause. I had designed for my defaults, not theirs.
The fixes were simple once I noticed them. Say month names. Use AM and PM. Train name pronunciation using local patterns.
The Principle That Changed How I Build Voice AI
Don’t design for how people should talk. Design for how they actually talk.
I wanted patients to say yes or no clearly. They said hmm, achha, and thik ache.
The moment I accepted that and built for their reality instead of mine, the system finally worked.
Your Turn
Have you ever built something that worked perfectly in your culture but failed in another? How do you handle cross-cultural communication in voice AI? What unexpected language patterns have surprised you?
Written by FARHAN HABIB FARAZ
Senior Prompt Engineer and Team Lead at PowerInAI
Building voice AI that understands when “hmm” means yes
Tags: voiceai, crosscultural, intentrecognition, localization, speechai, bangladesh
Top comments (0)