DEV Community

voipbin
voipbin

Posted on

Build a Multilingual AI Voice Bot: Auto-Detect and Respond in the Caller's Language

Your AI voice bot is live. It's handling calls. Then someone calls and speaks Spanish.

Your bot replies in English. The caller repeats themselves, louder. The bot still replies in English. The caller hangs up.

You just lost a customer because of a solvable engineering problem.

This post walks through how to build an AI voice bot that automatically detects the caller's language from their first utterance — and continues the entire conversation in that language. No upfront language selection menus. No "Press 1 for English, oprima 2 para español." Just a natural conversation.

What Makes This Hard

Language detection in voice bots isn't just about speech-to-text. There are three layers:

  1. STT (Speech-to-Text): The transcription engine must support the language
  2. LLM prompt: The model must be instructed to respond in the detected language
  3. TTS (Text-to-Speech): The voice synthesis must produce natural-sounding audio in that language

If any layer breaks, the experience degrades. If your TTS can only do English but your LLM responds in Spanish, callers hear phonetically mangled output. If your STT doesn't detect the language, you get garbled transcriptions.

VoIPBin handles the STT → TTS pipeline entirely — and returns the detected language alongside each transcript — which makes building this significantly simpler. Your server never touches audio.

The Architecture

Inbound call
    │
    ▼
VoIPBin receives audio
    │
    ▼
STT transcription + language detection
    │
    ▼
Webhook → your server (transcript + detected_language)
    │
    ▼
Language-specific LLM prompt
    │
    ▼
VoIPBin TTS in detected language
    │
    ▼
Caller hears response in their own language
Enter fullscreen mode Exit fullscreen mode

Step 1: Get a VoIPBin Token

Sign up — no OTP, token returned immediately:

curl -s -X POST "https://api.voipbin.net/v1.0/auth/signup" \
  -H "Content-Type: application/json" \
  -d '{"username": "yourname", "password": "yourpass"}'
# Response: { "token": "eyJ..." }
Enter fullscreen mode Exit fullscreen mode

Set your inbound number's webhook to your server. VoIPBin will POST a JSON payload on every caller utterance.

Step 2: Handle the Webhook and Detect Language

When VoIPBin sends a speech event, the payload includes the detected language code:

{
  "call_id": "abc-123",
  "transcript": "Hola, necesito ayuda con mi pedido",
  "language": "es",
  "confidence": 0.97,
  "caller": "+15551234567"
}
Enter fullscreen mode Exit fullscreen mode

Use that language field to select the right LLM system prompt:

from flask import Flask, request, jsonify
import httpx, os

app = Flask(__name__)

VOIPBIN_API = "https://api.voipbin.net/v1.0"
VOIPBIN_TOKEN = os.environ["VOIPBIN_TOKEN"]

SYSTEM_PROMPTS = {
    "en": "You are a helpful customer support agent. Respond in English. Be concise.",
    "es": "Eres un agente de soporte al cliente. Responde en español. Sé conciso.",
    "fr": "Vous êtes un agent de support client. Répondez en français. Soyez concis.",
    "ja": "あなたはカスタマーサポートエージェントです。日本語で回答してください。簡潔に。",
    "de": "Sie sind ein Kundensupport-Agent. Antworten Sie auf Deutsch. Seien Sie präzise.",
    "pt": "Você é um agente de suporte. Responda em português. Seja conciso.",
}

# In-memory store — use Redis in production
conversations = {}

@app.route("/webhook/speech", methods=["POST"])
def handle_speech():
    data = request.json
    call_id = data["call_id"]
    transcript = data["transcript"]
    detected_language = data.get("language", "en")

    # Lock language on first utterance
    if call_id not in conversations:
        conversations[call_id] = {"language": detected_language, "history": []}

    language = conversations[call_id]["language"]
    history = conversations[call_id]["history"]

    ai_response = get_ai_response(transcript, language, history)

    history.append({"role": "user", "content": transcript})
    history.append({"role": "assistant", "content": ai_response})

    speak_response(call_id, ai_response, language)
    return jsonify({"status": "ok"})

def get_ai_response(transcript: str, language: str, history: list) -> str:
    system_prompt = SYSTEM_PROMPTS.get(language, SYSTEM_PROMPTS["en"])
    if language != "en":
        system_prompt += f"\n\nIMPORTANT: Always respond in {language} only."

    messages = [{"role": "system", "content": system_prompt}]
    messages.extend(history)
    messages.append({"role": "user", "content": transcript})

    resp = httpx.post(
        "https://api.openai.com/v1/chat/completions",
        headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
        json={"model": "gpt-4o", "messages": messages, "max_tokens": 150}
    )
    return resp.json()["choices"][0]["message"]["content"]

def speak_response(call_id: str, text: str, language: str):
    httpx.post(
        f"{VOIPBIN_API}/calls/{call_id}/actions",
        headers={
            "Authorization": f"Bearer {VOIPBIN_TOKEN}",
            "Content-Type": "application/json"
        },
        json={"type": "speak", "text": text, "language": language, "voice": "neural"}
    )
Enter fullscreen mode Exit fullscreen mode

Step 3: Why You Lock Language on the First Utterance

Do not re-detect language on every turn. Mid-call language flips happen because:

  • The caller quotes something in English while speaking French
  • Background noise causes a brief misdetection
  • A proper noun triggers a false positive

Lock it once, hold it for the whole call:

# WRONG: Re-detecting every turn
language = data.get("language", "en")  # Unstable

# RIGHT: Lock on first utterance
if call_id not in conversations:
    conversations[call_id] = {"language": detected_language, ...}
language = conversations[call_id]["language"]  # Stable
Enter fullscreen mode Exit fullscreen mode

Step 4: Remember Returning Callers

For callers who call back, skip detection entirely — you already know their language:

import redis

redis_client = redis.Redis()

def get_caller_language(phone: str, detected: str) -> str:
    saved = redis_client.get(f"lang:{phone}")
    if saved:
        return saved.decode()
    # First call — use detected, then save
    redis_client.setex(f"lang:{phone}", 86400 * 30, detected)
    return detected
Enter fullscreen mode Exit fullscreen mode

Now returning callers get their preferred language before they even speak — the first TTS greeting can be in their language too.

Step 5: Graceful Fallback for Unsupported Languages

def get_language_prompt(language: str) -> str:
    if language in SYSTEM_PROMPTS:
        return SYSTEM_PROMPTS[language]
    # Unknown language — ask LLM to try, but offer fallback
    return (
        f"The caller is speaking {language}. "
        "If you know this language, respond in it. "
        "Otherwise, politely explain in English that you currently support "
        "English, Spanish, French, German, Portuguese, and Japanese."
    )
Enter fullscreen mode Exit fullscreen mode

What the Caller Experiences

Without this system:

🤖 Bot: "Hello! How can I help you today?"
👤 Caller: "Necesito ayuda con mi pedido número 4521"
🤖 Bot: "I'm sorry, could you repeat that in English?"
👤 Caller: [hangs up]

With this system:

🤖 Bot: "Hello! How can I help you today?"
👤 Caller: "Necesito ayuda con mi pedido número 4521"
🤖 Bot: "Claro, déjame revisar el pedido 4521. ¿Puede confirmar su nombre?"
👤 Caller: [stays on the line, gets helped]

The switch is invisible. No menu. No friction. Just a bot that listens.

Run It Locally

pip install flask httpx redis
gunicorn -w 4 app:app

# For local testing with ngrok:
ngrok http 8080
# Set VoIPBin webhook → https://your-ngrok-url/webhook/speech
Enter fullscreen mode Exit fullscreen mode

What You Get

  • Zero menu friction — no language selection prompts
  • Stable conversations — language locked on first utterance
  • Returning caller memory — preferred language saved in Redis
  • No audio code — your server handles text only; VoIPBin owns the audio
  • Global reach — any language your STT and LLM support

The infrastructure complexity (codecs, RTP, STT, TTS) lives inside VoIPBin. Your server implements the logic. That's the clean separation that makes this build under 100 lines of Python.


Try free at voipbin.net — token returned on signup, no verification step.

MCP server for Claude / Cursor: uvx voipbin-mcp

Golang SDK: go get github.com/voipbin/voipbin-go

Top comments (0)