I implemented complete linguistic intelligence for a multi-tenant RAG engine: heuristic language detection (ES/EN/PT) with zero latency, configurable priority chain, injection into LLM prompts, content rules (BLOCK/REDIRECT/FILTER) for moderation, and browser_lang from the widget. All without external APIs, in ~90 lines of code.
The Multilingual Problem in RAG
Most RAG tutorials assume a single language. In production you'll run into:
- Queries in English against documents in Spanish (or vice versa)
- Widgets embedded on Brazilian sites receiving Portuguese
- The LLM responds in the context's language instead of the user's language
Common "solutions" are detection APIs (Google, AWS Comprehend) that add latency and cost, libraries like langdetect/fasttext that are heavy dependencies for 3 languages, or simply ignoring the problem.
I needed something with zero latency, no dependencies, and configurable per Knowledge Base.
The Solution: Heuristic Detection + Priority Chain
General Approach
Instead of using a full statistical model, I attack the problem in layers:
User query
│
▼
┌──────────────────────┐
│ 1. Fixed override? │──→ If admin forced "en": always use "en"
└──────┬───────────────┘
│ No override
▼
┌──────────────────────┐
│ 2. Heuristic │──→ Count ES/EN/PT keywords
└──────┬───────────────┘
│ Score < threshold
▼
┌──────────────────────┐
│ 3. Browser lang? │──→ Browser language (widget)
└──────┬───────────────┘
│ Not available
▼
Default: "es"
The Word Lists
The heart of detection is three sets of common words per language:
import re
_PUNCT_RE = re.compile(r"[^\w\s]", re.UNICODE)
_EN_WORDS = {
"how", "what", "where", "when", "why", "who", "which", "can", "could",
"would", "should", "does", "do", "is", "are", "the", "this", "that",
"help", "please", "tell", "explain", "show", "need", "want", "have",
"about", "from", "with", "your", "they", "there", "their", "been",
"just", "also", "very", "some", "any", "other", "than", "into",
}
_PT_WORDS = {
"como", "onde", "quando", "porque", "quem", "qual", "pode", "poderia",
"esta", "são", "isso", "isto", "ajuda", "ajudar", "mostre", "explique",
"você", "vocês", "obrigado", "obrigada", "preciso", "quero", "tenho",
"sobre", "para", "com", "seu", "sua", "eles", "também", "muito",
"algum", "outro", "mais", "ainda", "aqui", "depois", "antes", "entre",
"posso", "gostaria", "fazer", "dizer", "favor", "bom", "boa", "dia",
"noite", "olá", "sim", "não", "bem", "tudo",
}
_ES_WORDS = {
"qué", "cómo", "dónde", "cuándo", "cuál", "quién", "puedo", "podría",
"necesito", "quiero", "tengo", "también", "aquí", "después", "ahora",
"entonces", "pero", "sino", "aunque", "desde", "hasta", "hacia",
"según", "ayuda", "explicar", "mostrar", "buscar", "hola", "gracias",
"información", "pregunta", "respuesta", "por", "favor",
}
Key decisions in the lists: English focuses on function words (the, is, are) that almost never appear in Spanish/Portuguese. Portuguese includes key differentiators vs ES: você, não, obrigado, tudo. Spanish uses accents when possible (qué, cómo, dónde) — a user who types with accents is almost certainly writing in Spanish.
The Detection Function
def detect_language(query: str) -> str:
"""Detect query language. Returns 'es', 'en', or 'pt'. Default 'es'."""
q = _PUNCT_RE.sub("", query.lower().strip())
words = set(q.split())
scores = {
"en": len(words & _EN_WORDS),
"pt": len(words & _PT_WORDS),
"es": len(words & _ES_WORDS),
}
best = max(scores, key=scores.get)
if scores[best] >= 2:
return best
return "es"
Why threshold of 2? With a single matching word, false positive risk is high. "como" exists in both Spanish and Portuguese. "para" too. But if we find 2+ words from one language, the probability of being correct jumps dramatically.
Why default "es"? My primary use case is Latin America. If I can't detect with confidence, Spanish is the safest bet. This is configurable — you can change the default based on your market.
The Priority Chain: resolve_language()
Heuristic detection is just one layer. The real function that decides the language is resolve_language():
_LANG_INSTRUCTIONS = {
"en": "Respond entirely in English.",
"pt": "Responda inteiramente em português.",
"es": "Responde completamente en español.",
}
def resolve_language(
query: str,
auto_detect: bool = True,
override: str | None = None,
browser_lang: str | None = None,
) -> str:
"""
Resolve the final language for a query.
Priority: override > heuristic detection > browser fallback > 'es'.
"""
# 1. Admin override: ignore everything, force language
if override and override in ("es", "en", "pt"):
return override
# 2. Heuristic: analyze the text
if auto_detect:
detected = detect_language(query)
# If detected something other than default, trust it
if detected != "es" or not browser_lang:
return detected
# 3. Browser lang: user's browser language
if browser_lang:
bl = browser_lang.lower()[:2]
if bl in ("en", "pt", "es"):
return bl
# 4. Default
return "es"
Why This Priority?
| Level | Source | When it wins |
|---|---|---|
| 1 | Admin override | Always. If the admin says "this KB is in English", it's respected |
| 2 | Heuristic | If it detects EN or PT with confidence (score >= 2) |
| 3 | Browser lang | If heuristic couldn't decide (returned "es" by default) |
| 4 | Default "es" | Last resort |
The subtle trick: if the heuristic returns "es", it could be actual Spanish OR "couldn't detect". In that case, we give browser_lang a chance. If the user's browser is in Portuguese and the query is ambiguous, it's probably Portuguese.
Usage in Chat vs Widget
# Chat: no browser_lang
detected_lang = resolve_language(
data.content,
auto_detect=rag_config.language_auto_detect,
override=rag_config.language_override,
)
# Widget: includes browser_lang as additional fallback
detected_lang = resolve_language(
data.message,
auto_detect=rag_config.language_auto_detect,
override=rag_config.language_override,
browser_lang=data.browser_lang,
)
lang_hint = get_language_instruction(detected_lang)
The widget captures the browser language and sends it with each request:
var body = {
message: text,
session_token: sessionToken,
browser_lang: widgetLang, // navigator.language || "es"
};
Injecting Language into the LLM System Prompt
Detecting the language isn't enough — you need to force the LLM to respond in that language. This is done by injecting an explicit instruction at the end of the system prompt:
def _build_messages(query, sources, ..., language_hint=None):
system_message = identity_prefix + mode_prefix + _build_system_prompt(context)
if language_hint:
system_message += f"\n\nIMPORTANT: {language_hint}"
messages = [{"role": "system", "content": system_message}]
messages.append({"role": "user", "content": query})
return messages
Where language_hint is one of:
_LANG_INSTRUCTIONS = {
"en": "Respond entirely in English.",
"pt": "Responda inteiramente em português.",
"es": "Responde completamente en español.",
}
Why at the end of the system prompt? LLMs tend to give more weight to instructions at the beginning and end of the prompt (recency bias). Putting the language at the end maximizes the probability of compliance, even when the document context is in a different language.
Why "IMPORTANT:"? For the same reason. LLMs respond better to instructions marked as important. Without this prefix, Llama 3.3 sometimes ignored the language instruction when all context was in another language.
Multilingual Embeddings: The Silent Hero
All of this works thanks to an embedding model that understands multiple languages in the same vector space:
# config.py
embedding_model: str = "paraphrase-multilingual-MiniLM-L12-v2"
paraphrase-multilingual-MiniLM-L12-v2 generates 384-dimensional vectors and supports 50+ languages. This means:
- "How do I reset my password?" (English)
- "¿Cómo reseteo mi contraseña?" (Spanish)
- "Como redefinir minha senha?" (Portuguese)
Produce embeddings close in vector space, despite being in different languages.
Without multilingual embeddings, an English query would never find Spanish documents. With this model, the user asks in English, pgvector finds Spanish chunks (because the meaning is close), the cross-encoder confirms relevance, and the LLM responds in English. No intermediate translation.
The reranker is also multilingual (cross-encoder/mmarco-mMiniLMv2-L12-H384-v1, trained on mMARCO, 14 languages). Crucial because it evaluates query + chunk together — if they're in different languages, it needs to understand both.
BM25: The Hardcoded tsvector Limitation
There's an elephant in the room. My BM25 search uses PostgreSQL tsvector with 'spanish' hardcoded:
SELECT c.id, c.content, d.filename as document_name,
ts_rank_cd(c.search_vector, plainto_tsquery('spanish', :query)) as score
FROM chunks c
JOIN documents d ON d.id = c.document_id
WHERE c.knowledge_base_id IN (:kb_ids)
AND c.search_vector @@ plainto_tsquery('spanish', :query)
ORDER BY score DESC
LIMIT :top_k
The problem: if the user asks in English, plainto_tsquery('spanish', 'how do I reset') won't tokenize English words correctly. Spanish stemming converts "reset" differently than English stemming.
Why I Didn't Change It
- Vector search compensates: BM25 is 30% of hybrid search. If it fails for cross-language queries, vector search (70%) still works perfectly thanks to multilingual embeddings.
-
The OR fallback helps: When the AND query finds no results, a fallback to OR (
to_tsquerywith|) is more permissive and rescues partial matches. -
Complexity vs benefit: Dynamic tsvector by language would require multiple
search_vectorcolumns or on-the-fly regeneration. The marginal improvement doesn't justify the cost when vector search is the primary component.
This is a known and documented limitation. If your use case has 80%+ queries in English with English documents, you should change 'spanish' to 'english' — or better yet, to 'simple' which does basic tokenization without language-specific stemming.
Content Rules: Moderation Without LLM
Beyond detecting language, I implemented a content rules system that acts before the RAG pipeline. Three types:
BLOCK: Stop the Query
def evaluate_block_redirect(
query: str, rules: list[ContentRule]
) -> dict | None:
q_lower = query.lower()
for rule in rules:
if not rule.enabled or rule.type == "filter":
continue
for trigger in rule.triggers:
if trigger.lower() in q_lower:
return {"type": rule.type, "response": rule.response}
return None
If a BLOCK trigger matches, the user receives the configured response and the RAG pipeline never executes. Zero tokens, zero LLM latency.
REDIRECT and FILTER
REDIRECT uses the same mechanism as BLOCK but to redirect ("For billing inquiries, contact support@company.com"). FILTER is different — it acts post-retrieval, removing chunks before sending them to the LLM:
def get_filter_terms(rules: list[ContentRule]) -> list[str]:
terms = []
for rule in rules:
if rule.enabled and rule.type == "filter":
terms.extend(rule.triggers)
return [t.lower() for t in terms]
def filter_chunks(sources: list[dict], filter_terms: list[str]) -> list[dict]:
if not filter_terms:
return sources
filtered = []
for source in sources:
content_lower = source.get("content", "").lower()
if not any(term in content_lower for term in filter_terms):
filtered.append(source)
return filtered
Useful when your documents have sections you don't want the LLM to use as context (internal pricing, confidential information in partially public documents).
The Full Flow
# 1. Content rules check (BLOCK / REDIRECT) — before everything
if rag_config.content_rules:
rule_match = evaluate_block_redirect(data.content, rag_config.content_rules)
if rule_match:
yield {"event": "content_blocked", "data": json.dumps({
"type": rule_match["type"],
"response": rule_match["response"],
})}
return
# 2. Normal RAG pipeline (vector + BM25 + rerank + MMR)
sources = await search_chunks(db, query, kb_ids, rag_config=rag_config)
# 3. Apply FILTER content rules — after retrieval, before LLM
if rag_config.content_rules:
blocked_terms = get_filter_terms(rag_config.content_rules)
if blocked_terms:
sources = filter_blocked_chunks(sources, blocked_terms)
# 4. LLM with language hint
response = stream_chat_response(query, sources, language_hint=lang_hint)
Everything is stored in each Knowledge Base's rag_config JSONB — no separate table, no migrations:
class ContentRule(BaseModel):
type: Literal["block", "redirect", "filter"]
triggers: list[str] = Field(default_factory=list)
response: str = Field(default="", max_length=500)
enabled: bool = True
class RAGConfig(BaseModel):
# ... retrieval, llm, processing configs ...
language_auto_detect: bool = Field(default=True)
language_override: str | None = Field(default=None, pattern=r"^(es|en|pt)$")
content_rules: list[ContentRule] = Field(default_factory=list)
Each KB has its own detection, its own override, and its own content rules.
Frontend and Tracking
In the RAG configuration dialog, the admin has an auto-detection toggle, forced language selector, and inline content rules CRUD:
const languageAutoDetect = ref(true)
const languageOverride = ref('')
const contentRules = reactive<ContentRule[]>([])
// On save
payload.language_auto_detect = languageAutoDetect.value
payload.language_override = languageOverride.value || null
payload.content_rules = contentRules.map(r => ({ ...toRaw(r) }))
If the override is active, the auto-detection toggle is visually disabled — no point detecting if you've already forced a language.
Every query is logged with the detected language (migration 027: detected_language VARCHAR(5) on usage_logs), which allows analyzing language distribution per KB and detecting failure patterns.
Edge Cases and Limitations
1. Very Short Queries
"Reset" — is it English or Spanish (used as an anglicism)? With a single word, there's not enough context. The 2-word minimum threshold protects against this, but it means 1-2 word queries always return the default.
2. Spanglish and Code-Switching
"Necesito help con el login" — tie between ES and EN. Since the score doesn't reach 2 for English, it returns "es". Correct, but not for the ideal reasons.
3. Portuguese vs Spanish
Many words are identical ("como", "para", "sobre"). Differentiation depends on exclusive words like você, não, obrigado. Without them, a Brazilian user may be detected as Spanish-speaking.
4. Pure Technical Queries
"HTTP 403 POST /api/users" — score 0 in all languages, returns default. For queries without natural words, language matters less.
Numbers
| Aspect | Value |
|---|---|
| Detection latency | ~0.01ms (set intersection) |
| Word lists total | ~120 words (40 EN + 45 PT + 35 ES) |
| Confidence threshold | 2 words minimum |
| Languages supported | 3 (ES, EN, PT) |
| Multilingual embeddings | 50+ languages (384d) |
| Multilingual cross-encoder | 14 languages |
| Content rules per KB | Unlimited |
| Total overhead | < 1ms per request |
Lessons Learned
1. You Don't Need an ML Model to Detect 3 Languages
For a small set of languages, word lists + scoring is absurdly effective. An ML model needs to be loaded in memory, has cold start, and its accuracy for short queries (5-10 words) isn't significantly better than a well-calibrated heuristic approach.
2. The Priority Chain Is More Important Than Detection
The resolve_language() function is more valuable than detect_language(). Being able to combine admin override + automatic detection + browser signal in a configurable chain covers 99% of cases. Heuristic detection alone would cover maybe 85%.
3. browser_lang Is Underestimated
In embedded widgets, the browser language is an extremely strong signal. If the browser is set to pt-BR and the query is ambiguous, the user is almost certainly Brazilian. Adding this field to the widget request was one line of code that significantly improved the experience for short or ambiguous queries.
4. Content Rules Are More Useful Than Expected
I started implementing content rules as "nice to have" for basic moderation. In practice, admins use them creatively: redirect billing questions to the right email, block queries about competitors, filter internal sections from documents. They're a control layer that bypasses the LLM.
5. Multilingual Embeddings Do 80% of the Work
The uncomfortable truth is that if you use paraphrase-multilingual-MiniLM-L12-v2 for embeddings, cross-language retrieval works pretty well without doing anything else. Language detection is primarily for controlling the response language of the LLM, not retrieval.
6. Logging the Detected Language Is Essential
Without the detected_language field in usage_logs, I'd be guessing if detection works well. With the data, I can see patterns: "15% of queries to the Brazil widget are detected as Spanish" tells me I need to expand the Portuguese word lists.
What's Next
- Character n-gram detection: More robust for short queries than word lists
-
Dynamic tsvector: Choose dictionary based on detected language (
'english','portuguese','simple') - More languages: French and German only need new word lists
- Dynamic stop words: Currently only Spanish — should adapt to the detected language
Conclusion
Multilingual support in RAG doesn't need to be complicated or expensive. Multilingual embeddings as the foundation, heuristic detection in ~90 lines, and a well-designed priority chain cover 99% of cases for ES/EN/PT.
What matters isn't the detection itself — it's the architecture: configurable per KB, with reasonable fallbacks, that logs its decisions for improvement, and where content rules give admins control without going through the LLM. Sometimes the simplest solution is the right one.
If you work with multilingual RAG and found other edge cases, drop a comment. And if this article was useful, a like helps it reach more people.
Top comments (0)