Seung Park

Posted on Mar 29

Building a Multi-Language Voice AI Agent: Automatic Language Detection for Restaurant Phone Systems

#ai #python #nlp #webdev

At RingFoods, we build AI voice agents that answer restaurant phone calls. One of the hardest engineering challenges we faced was making the system work seamlessly across multiple languages without requiring the caller to press 1 for English or 2 for Spanish.

Here is how we approached automatic language detection in a real-time voice pipeline, and what we learned along the way.

Why Language Matters for Restaurant Phone Systems

Restaurants in cities like Miami, Los Angeles, Toronto, and New York serve communities where English is not always the first language. A Thai restaurant in LA might get calls in Thai, Mandarin, Spanish, and English on the same afternoon. A pho shop in Montreal fields calls in French, Vietnamese, and English.

Traditional IVR menus that ask callers to select a language add friction. Callers hang up. The whole point of an AI phone agent is to reduce friction, not add it.

The Detection Pipeline

Our approach uses a three-stage detection system:

Stage 1: First Utterance Analysis

When a caller speaks their first sentence, the speech-to-text engine processes the audio and returns a language confidence score alongside the transcript. We use this initial signal as our primary language indicator.

# Simplified language detection from STT output
def detect_language(stt_result):
    primary_lang = stt_result.language_code  # e.g., "es-MX"
    confidence = stt_result.language_confidence  # 0.0 to 1.0

    if confidence > 0.85:
        return primary_lang, "high"
    elif confidence > 0.60:
        return primary_lang, "medium"
    else:
        return "en-US", "fallback"  # Default to English

The threshold matters. Setting it too low means you accidentally switch languages on a caller who just mumbled. Setting it too high means you miss legitimate non-English speakers.

Stage 2: Contextual Confirmation

If the confidence is in the medium range (0.60 to 0.85), we do not immediately commit. Instead, the agent responds in the detected language but keeps listening. If the next two utterances confirm the same language, we lock it in. If they contradict, we fall back to English and ask.

Stage 3: Mid-Call Switching

Some callers switch languages mid-call. A bilingual caller might start in Spanish, then switch to English when discussing a specific menu item. We handle this by monitoring language signals throughout the call but only triggering a full language switch if three consecutive utterances are in a different language.

The Greeting Problem

The hardest part was the greeting. When the AI answers, what language should it greet in? We tried three approaches:

English default with quick pivot — Greet in English, detect the caller's language from their first response, then switch. This works but feels jarring for non-English speakers.
Restaurant-configured primary language — Let the restaurant owner set their primary language. A Mexican restaurant in Houston might set Spanish as the greeting language. Simple but inflexible.
Caller ID history — If we have seen this phone number before and know their preferred language, greet in that language. For new callers, use the restaurant's configured default. This is what we shipped.

def get_greeting_language(caller_id, restaurant_config):
    # Check caller history
    caller_pref = db.get_caller_language(caller_id)
    if caller_pref:
        return caller_pref

    # Fall back to restaurant default
    return restaurant_config.primary_language or "en-US"

Handling Menu Items Across Languages

Menu items create a unique challenge. A caller speaking Spanish might say "quiero ordenar pad thai" — mixing Spanish with a Thai dish name. Our entity extraction needs to handle code-switching gracefully.

We solved this by maintaining a normalized menu item index that maps phonetic variations across languages to canonical menu items. The menu OCR system that processes restaurant menus also generates these cross-language mappings automatically.

Latency Considerations

Adding language detection adds latency. In voice applications, every millisecond matters. Our target is under 500ms total round-trip time from when the caller stops speaking to when the AI starts responding.

Language detection adds roughly 50-80ms to the pipeline. We mitigate this by:

Running detection in parallel with intent classification
Caching language decisions per caller session
Pre-loading language-specific TTS models based on the restaurant's geographic region

What We Got Wrong Initially

Our first implementation tried to support 15 languages simultaneously. We learned quickly that supporting fewer languages well is better than supporting many languages poorly. We narrowed to 6 languages for our initial launch based on actual call data from partner restaurants: English, Spanish, Mandarin, French, Vietnamese, and Korean.

We also learned that accent detection is not the same as language detection. A native Spanish speaker with a heavy accent speaking English should still get English responses. Our early models confused accent with language about 8 percent of the time. Fine-tuning on restaurant-specific call recordings brought this down to under 2 percent.

Results

After deploying multi-language support across our restaurant partners:

Call completion rates increased 23 percent for restaurants in multilingual neighborhoods
Average call duration decreased by 15 seconds (no more language selection menus)
Customer satisfaction scores improved, particularly for non-English-speaking callers who previously had to struggle through English-only systems

Try It Yourself

If you are building voice AI applications and want to see how RingFoods handles multi-language calls in production, we offer a 30-day free trial with no credit card required. The system handles reservations, orders, and inquiries across multiple languages automatically.

Seung Hyun Park is an engineer at RingFoods, building AI voice agents for restaurants. Based in Vancouver, BC.

DEV Community