Martin Palopoli

Posted on Apr 21

Language Detection Without External APIs for a Multilingual RAG System

#ai #python #rag #nlp

I implemented complete linguistic intelligence for a multi-tenant RAG engine: heuristic language detection (ES/EN/PT) with zero latency, configurable priority chain, injection into LLM prompts, content rules (BLOCK/REDIRECT/FILTER) for moderation, and browser_lang from the widget. All without external APIs, in ~90 lines of code.

The Multilingual Problem in RAG

Most RAG tutorials assume a single language. In production you'll run into:

Queries in English against documents in Spanish (or vice versa)
Widgets embedded on Brazilian sites receiving Portuguese
The LLM responds in the context's language instead of the user's language

Common "solutions" are detection APIs (Google, AWS Comprehend) that add latency and cost, libraries like langdetect/fasttext that are heavy dependencies for 3 languages, or simply ignoring the problem.

I needed something with zero latency, no dependencies, and configurable per Knowledge Base.

The Solution: Heuristic Detection + Priority Chain

General Approach

Instead of using a full statistical model, I attack the problem in layers:

User query
       │
       ▼
┌──────────────────────┐
│  1. Fixed override?   │──→ If admin forced "en": always use "en"
└──────┬───────────────┘
       │ No override
       ▼
┌──────────────────────┐
│  2. Heuristic         │──→ Count ES/EN/PT keywords
└──────┬───────────────┘
       │ Score < threshold
       ▼
┌──────────────────────┐
│  3. Browser lang?     │──→ Browser language (widget)
└──────┬───────────────┘
       │ Not available
       ▼
    Default: "es"

The Word Lists

The heart of detection is three sets of common words per language:

import re

_PUNCT_RE = re.compile(r"[^\w\s]", re.UNICODE)

_EN_WORDS = {
    "how", "what", "where", "when", "why", "who", "which", "can", "could",
    "would", "should", "does", "do", "is", "are", "the", "this", "that",
    "help", "please", "tell", "explain", "show", "need", "want", "have",
    "about", "from", "with", "your", "they", "there", "their", "been",
    "just", "also", "very", "some", "any", "other", "than", "into",
}

_PT_WORDS = {
    "como", "onde", "quando", "porque", "quem", "qual", "pode", "poderia",
    "esta", "são", "isso", "isto", "ajuda", "ajudar", "mostre", "explique",
    "você", "vocês", "obrigado", "obrigada", "preciso", "quero", "tenho",
    "sobre", "para", "com", "seu", "sua", "eles", "também", "muito",
    "algum", "outro", "mais", "ainda", "aqui", "depois", "antes", "entre",
    "posso", "gostaria", "fazer", "dizer", "favor", "bom", "boa", "dia",
    "noite", "olá", "sim", "não", "bem", "tudo",
}

_ES_WORDS = {
    "qué", "cómo", "dónde", "cuándo", "cuál", "quién", "puedo", "podría",
    "necesito", "quiero", "tengo", "también", "aquí", "después", "ahora",
    "entonces", "pero", "sino", "aunque", "desde", "hasta", "hacia",
    "según", "ayuda", "explicar", "mostrar", "buscar", "hola", "gracias",
    "información", "pregunta", "respuesta", "por", "favor",
}

Key decisions in the lists: English focuses on function words (the, is, are) that almost never appear in Spanish/Portuguese. Portuguese includes key differentiators vs ES: você, não, obrigado, tudo. Spanish uses accents when possible (qué, cómo, dónde) — a user who types with accents is almost certainly writing in Spanish.

The Detection Function

def detect_language(query: str) -> str:
    """Detect query language. Returns 'es', 'en', or 'pt'. Default 'es'."""
    q = _PUNCT_RE.sub("", query.lower().strip())
    words = set(q.split())

    scores = {
        "en": len(words & _EN_WORDS),
        "pt": len(words & _PT_WORDS),
        "es": len(words & _ES_WORDS),
    }
    best = max(scores, key=scores.get)

    if scores[best] >= 2:
        return best
    return "es"

Why threshold of 2? With a single matching word, false positive risk is high. "como" exists in both Spanish and Portuguese. "para" too. But if we find 2+ words from one language, the probability of being correct jumps dramatically.

Why default "es"? My primary use case is Latin America. If I can't detect with confidence, Spanish is the safest bet. This is configurable — you can change the default based on your market.

The Priority Chain: `resolve_language()`

Heuristic detection is just one layer. The real function that decides the language is resolve_language():

_LANG_INSTRUCTIONS = {
    "en": "Respond entirely in English.",
    "pt": "Responda inteiramente em português.",
    "es": "Responde completamente en español.",
}


def resolve_language(
    query: str,
    auto_detect: bool = True,
    override: str | None = None,
    browser_lang: str | None = None,
) -> str:
    """
    Resolve the final language for a query.
    Priority: override > heuristic detection > browser fallback > 'es'.
    """
    # 1. Admin override: ignore everything, force language
    if override and override in ("es", "en", "pt"):
        return override

    # 2. Heuristic: analyze the text
    if auto_detect:
        detected = detect_language(query)
        # If detected something other than default, trust it
        if detected != "es" or not browser_lang:
            return detected

    # 3. Browser lang: user's browser language
    if browser_lang:
        bl = browser_lang.lower()[:2]
        if bl in ("en", "pt", "es"):
            return bl

    # 4. Default
    return "es"

Why This Priority?

Level	Source	When it wins
1	Admin override	Always. If the admin says "this KB is in English", it's respected
2	Heuristic	If it detects EN or PT with confidence (score >= 2)
3	Browser lang	If heuristic couldn't decide (returned "es" by default)
4	Default "es"	Last resort

The subtle trick: if the heuristic returns "es", it could be actual Spanish OR "couldn't detect". In that case, we give browser_lang a chance. If the user's browser is in Portuguese and the query is ambiguous, it's probably Portuguese.

Usage in Chat vs Widget

# Chat: no browser_lang
detected_lang = resolve_language(
    data.content,
    auto_detect=rag_config.language_auto_detect,
    override=rag_config.language_override,
)

# Widget: includes browser_lang as additional fallback
detected_lang = resolve_language(
    data.message,
    auto_detect=rag_config.language_auto_detect,
    override=rag_config.language_override,
    browser_lang=data.browser_lang,
)
lang_hint = get_language_instruction(detected_lang)

The widget captures the browser language and sends it with each request:

var body = {
  message: text,
  session_token: sessionToken,
  browser_lang: widgetLang,  // navigator.language || "es"
};

Injecting Language into the LLM System Prompt

Detecting the language isn't enough — you need to force the LLM to respond in that language. This is done by injecting an explicit instruction at the end of the system prompt:

def _build_messages(query, sources, ..., language_hint=None):
    system_message = identity_prefix + mode_prefix + _build_system_prompt(context)

    if language_hint:
        system_message += f"\n\nIMPORTANT: {language_hint}"

    messages = [{"role": "system", "content": system_message}]
    messages.append({"role": "user", "content": query})
    return messages

Where language_hint is one of:

_LANG_INSTRUCTIONS = {
    "en": "Respond entirely in English.",
    "pt": "Responda inteiramente em português.",
    "es": "Responde completamente en español.",
}

Why at the end of the system prompt? LLMs tend to give more weight to instructions at the beginning and end of the prompt (recency bias). Putting the language at the end maximizes the probability of compliance, even when the document context is in a different language.

Why "IMPORTANT:"? For the same reason. LLMs respond better to instructions marked as important. Without this prefix, Llama 3.3 sometimes ignored the language instruction when all context was in another language.

Multilingual Embeddings: The Silent Hero

All of this works thanks to an embedding model that understands multiple languages in the same vector space:

# config.py
embedding_model: str = "paraphrase-multilingual-MiniLM-L12-v2"

paraphrase-multilingual-MiniLM-L12-v2 generates 384-dimensional vectors and supports 50+ languages. This means:

"How do I reset my password?" (English)
"¿Cómo reseteo mi contraseña?" (Spanish)
"Como redefinir minha senha?" (Portuguese)

Produce embeddings close in vector space, despite being in different languages.

Without multilingual embeddings, an English query would never find Spanish documents. With this model, the user asks in English, pgvector finds Spanish chunks (because the meaning is close), the cross-encoder confirms relevance, and the LLM responds in English. No intermediate translation.

The reranker is also multilingual (cross-encoder/mmarco-mMiniLMv2-L12-H384-v1, trained on mMARCO, 14 languages). Crucial because it evaluates query + chunk together — if they're in different languages, it needs to understand both.

BM25: The Hardcoded tsvector Limitation

There's an elephant in the room. My BM25 search uses PostgreSQL tsvector with 'spanish' hardcoded:

SELECT c.id, c.content, d.filename as document_name,
       ts_rank_cd(c.search_vector, plainto_tsquery('spanish', :query)) as score
FROM chunks c
JOIN documents d ON d.id = c.document_id
WHERE c.knowledge_base_id IN (:kb_ids)
  AND c.search_vector @@ plainto_tsquery('spanish', :query)
ORDER BY score DESC
LIMIT :top_k

The problem: if the user asks in English, plainto_tsquery('spanish', 'how do I reset') won't tokenize English words correctly. Spanish stemming converts "reset" differently than English stemming.

Why I Didn't Change It

Vector search compensates: BM25 is 30% of hybrid search. If it fails for cross-language queries, vector search (70%) still works perfectly thanks to multilingual embeddings.
The OR fallback helps: When the AND query finds no results, a fallback to OR (to_tsquery with |) is more permissive and rescues partial matches.
Complexity vs benefit: Dynamic tsvector by language would require multiple search_vector columns or on-the-fly regeneration. The marginal improvement doesn't justify the cost when vector search is the primary component.

This is a known and documented limitation. If your use case has 80%+ queries in English with English documents, you should change 'spanish' to 'english' — or better yet, to 'simple' which does basic tokenization without language-specific stemming.

Content Rules: Moderation Without LLM

Beyond detecting language, I implemented a content rules system that acts before the RAG pipeline. Three types:

BLOCK: Stop the Query

def evaluate_block_redirect(
    query: str, rules: list[ContentRule]
) -> dict | None:
    q_lower = query.lower()
    for rule in rules:
        if not rule.enabled or rule.type == "filter":
            continue
        for trigger in rule.triggers:
            if trigger.lower() in q_lower:
                return {"type": rule.type, "response": rule.response}
    return None

If a BLOCK trigger matches, the user receives the configured response and the RAG pipeline never executes. Zero tokens, zero LLM latency.

REDIRECT and FILTER

REDIRECT uses the same mechanism as BLOCK but to redirect ("For billing inquiries, contact support@company.com"). FILTER is different — it acts post-retrieval, removing chunks before sending them to the LLM:

def get_filter_terms(rules: list[ContentRule]) -> list[str]:
    terms = []
    for rule in rules:
        if rule.enabled and rule.type == "filter":
            terms.extend(rule.triggers)
    return [t.lower() for t in terms]


def filter_chunks(sources: list[dict], filter_terms: list[str]) -> list[dict]:
    if not filter_terms:
        return sources
    filtered = []
    for source in sources:
        content_lower = source.get("content", "").lower()
        if not any(term in content_lower for term in filter_terms):
            filtered.append(source)
    return filtered

Useful when your documents have sections you don't want the LLM to use as context (internal pricing, confidential information in partially public documents).

The Full Flow

# 1. Content rules check (BLOCK / REDIRECT) — before everything
if rag_config.content_rules:
    rule_match = evaluate_block_redirect(data.content, rag_config.content_rules)
    if rule_match:
        yield {"event": "content_blocked", "data": json.dumps({
            "type": rule_match["type"],
            "response": rule_match["response"],
        })}
        return

# 2. Normal RAG pipeline (vector + BM25 + rerank + MMR)
sources = await search_chunks(db, query, kb_ids, rag_config=rag_config)

# 3. Apply FILTER content rules — after retrieval, before LLM
if rag_config.content_rules:
    blocked_terms = get_filter_terms(rag_config.content_rules)
    if blocked_terms:
        sources = filter_blocked_chunks(sources, blocked_terms)

# 4. LLM with language hint
response = stream_chat_response(query, sources, language_hint=lang_hint)

Everything is stored in each Knowledge Base's rag_config JSONB — no separate table, no migrations:

class ContentRule(BaseModel):
    type: Literal["block", "redirect", "filter"]
    triggers: list[str] = Field(default_factory=list)
    response: str = Field(default="", max_length=500)
    enabled: bool = True

class RAGConfig(BaseModel):
    # ... retrieval, llm, processing configs ...
    language_auto_detect: bool = Field(default=True)
    language_override: str | None = Field(default=None, pattern=r"^(es|en|pt)$")
    content_rules: list[ContentRule] = Field(default_factory=list)

Each KB has its own detection, its own override, and its own content rules.

Frontend and Tracking

In the RAG configuration dialog, the admin has an auto-detection toggle, forced language selector, and inline content rules CRUD:

const languageAutoDetect = ref(true)
const languageOverride = ref('')
const contentRules = reactive<ContentRule[]>([])

// On save
payload.language_auto_detect = languageAutoDetect.value
payload.language_override = languageOverride.value || null
payload.content_rules = contentRules.map(r => ({ ...toRaw(r) }))

If the override is active, the auto-detection toggle is visually disabled — no point detecting if you've already forced a language.

Every query is logged with the detected language (migration 027: detected_language VARCHAR(5) on usage_logs), which allows analyzing language distribution per KB and detecting failure patterns.

Edge Cases and Limitations

1. Very Short Queries

"Reset" — is it English or Spanish (used as an anglicism)? With a single word, there's not enough context. The 2-word minimum threshold protects against this, but it means 1-2 word queries always return the default.

2. Spanglish and Code-Switching

"Necesito help con el login" — tie between ES and EN. Since the score doesn't reach 2 for English, it returns "es". Correct, but not for the ideal reasons.

3. Portuguese vs Spanish

Many words are identical ("como", "para", "sobre"). Differentiation depends on exclusive words like você, não, obrigado. Without them, a Brazilian user may be detected as Spanish-speaking.

4. Pure Technical Queries

"HTTP 403 POST /api/users" — score 0 in all languages, returns default. For queries without natural words, language matters less.

Numbers

Aspect	Value
Detection latency	~0.01ms (set intersection)
Word lists total	~120 words (40 EN + 45 PT + 35 ES)
Confidence threshold	2 words minimum
Languages supported	3 (ES, EN, PT)
Multilingual embeddings	50+ languages (384d)
Multilingual cross-encoder	14 languages
Content rules per KB	Unlimited
Total overhead	< 1ms per request

Lessons Learned

1. You Don't Need an ML Model to Detect 3 Languages

For a small set of languages, word lists + scoring is absurdly effective. An ML model needs to be loaded in memory, has cold start, and its accuracy for short queries (5-10 words) isn't significantly better than a well-calibrated heuristic approach.

2. The Priority Chain Is More Important Than Detection

The resolve_language() function is more valuable than detect_language(). Being able to combine admin override + automatic detection + browser signal in a configurable chain covers 99% of cases. Heuristic detection alone would cover maybe 85%.

3. browser_lang Is Underestimated

In embedded widgets, the browser language is an extremely strong signal. If the browser is set to pt-BR and the query is ambiguous, the user is almost certainly Brazilian. Adding this field to the widget request was one line of code that significantly improved the experience for short or ambiguous queries.

4. Content Rules Are More Useful Than Expected

I started implementing content rules as "nice to have" for basic moderation. In practice, admins use them creatively: redirect billing questions to the right email, block queries about competitors, filter internal sections from documents. They're a control layer that bypasses the LLM.

5. Multilingual Embeddings Do 80% of the Work

The uncomfortable truth is that if you use paraphrase-multilingual-MiniLM-L12-v2 for embeddings, cross-language retrieval works pretty well without doing anything else. Language detection is primarily for controlling the response language of the LLM, not retrieval.

6. Logging the Detected Language Is Essential

Without the detected_language field in usage_logs, I'd be guessing if detection works well. With the data, I can see patterns: "15% of queries to the Brazil widget are detected as Spanish" tells me I need to expand the Portuguese word lists.

What's Next

Character n-gram detection: More robust for short queries than word lists
Dynamic tsvector: Choose dictionary based on detected language ('english', 'portuguese', 'simple')
More languages: French and German only need new word lists
Dynamic stop words: Currently only Spanish — should adapt to the detected language

Conclusion

Multilingual support in RAG doesn't need to be complicated or expensive. Multilingual embeddings as the foundation, heuristic detection in ~90 lines, and a well-designed priority chain cover 99% of cases for ES/EN/PT.

What matters isn't the detection itself — it's the architecture: configurable per KB, with reasonable fallbacks, that logs its decisions for improvement, and where content rules give admins control without going through the LLM. Sometimes the simplest solution is the right one.

If you work with multilingual RAG and found other edge cases, drop a comment. And if this article was useful, a like helps it reach more people.

DEV Community

Language Detection Without External APIs for a Multilingual RAG System

The Multilingual Problem in RAG

The Solution: Heuristic Detection + Priority Chain

General Approach

The Word Lists

The Detection Function

The Priority Chain: `resolve_language()`

Why This Priority?

Usage in Chat vs Widget

Injecting Language into the LLM System Prompt

Multilingual Embeddings: The Silent Hero

BM25: The Hardcoded tsvector Limitation

Why I Didn't Change It

Content Rules: Moderation Without LLM

BLOCK: Stop the Query

REDIRECT and FILTER

The Full Flow

Frontend and Tracking

Edge Cases and Limitations

1. Very Short Queries

2. Spanglish and Code-Switching

3. Portuguese vs Spanish

4. Pure Technical Queries

Numbers

Lessons Learned

1. You Don't Need an ML Model to Detect 3 Languages

2. The Priority Chain Is More Important Than Detection

3. browser_lang Is Underestimated

4. Content Rules Are More Useful Than Expected

5. Multilingual Embeddings Do 80% of the Work

6. Logging the Detected Language Is Essential

What's Next

Conclusion

Top comments (0)

The Multilingual Problem in RAG

The Solution: Heuristic Detection + Priority Chain

General Approach

The Word Lists

The Detection Function

The Priority Chain: resolve_language()

Why This Priority?

Usage in Chat vs Widget

Injecting Language into the LLM System Prompt

Multilingual Embeddings: The Silent Hero

BM25: The Hardcoded tsvector Limitation

Why I Didn't Change It

Content Rules: Moderation Without LLM

BLOCK: Stop the Query

REDIRECT and FILTER

The Full Flow

Frontend and Tracking

Edge Cases and Limitations

1. Very Short Queries

2. Spanglish and Code-Switching

3. Portuguese vs Spanish

4. Pure Technical Queries

Numbers

Lessons Learned

1. You Don't Need an ML Model to Detect 3 Languages

2. The Priority Chain Is More Important Than Detection

3. browser_lang Is Underestimated

4. Content Rules Are More Useful Than Expected

5. Multilingual Embeddings Do 80% of the Work

6. Logging the Detected Language Is Essential

What's Next

Conclusion

The Priority Chain: `resolve_language()`