HarinezumIgel

Posted on Apr 23 • Edited on Apr 25

Speaking the Corpus's Language: How Multilingual RAG Stays Coherent Across Turns

#python #rag #opensource #architecture

This article assumes you already run a multi-turn RAG pipeline and have query rewriting enabled.

Multilingual RAG breaks in more places than just query translation
Query rewriting can re‑introduce the wrong language
Treat rewritten queries as raw input and re‑detect + re‑translate
Cheap detection + caching makes this practical

Most RAG tutorials quietly assume this works. In production, it fails in ways that look like “bad retrieval” but are actually language drift.

A German user asks a question in German.
The corpus is in English.
A query rewriter pulls English entity names out of earlier answers.

Somewhere in the middle, translation has to happen — twice.

This article walks through the two translation passes that keep a multilingual RAG pipeline working correctly across multiple turns, why each one is necessary, and what breaks without them. It also documents the places where things still go wrong — because they do.

The setup

A typical local RAG stack:

Embeddings + vector store: An embedding model primarily optimized for English, but with usable multilingual similarity + Chroma DB
Retrieval: Hybrid (vector + BM25 + graph), fused with Reciprocal Rank Fusion
Generation: local LLM via Ollama
Query rewriting: instruction-style rewriting of follow-up queries into self-contained questions

The corpus is English. Users are not.

That single asymmetry — English internals, multilingual edges — forces translation into more places than people expect.

Pass 1 — Translating the user query (every turn)

A user might ask:

Was für Tiere sind beschrieben?

Embeddings, BM25, and the corpus all operate in English. Sending the raw German query straight to retrieval yields poor results for a simple reason: there is no overlap signal.

Each turn therefore runs:

detect language → translate to English → retrieve

Language detection happens on every turn, not once per session. Users frequently code-switch inside a single query, especially when technical terms or entity names appear in previous answers.

The detected source language is stored in the session so the final LLM call can be instructed to answer in the user's language.

Here is what that first turn looks like in the running system:

💬 Your query>  what animals are described?

🔵 UserQuery      Original user query: 'what animals are described?'
🔵 LangDetect     Detected language: English (en) — confidence: 100%
🔵 QueryRewrite   No conversation history — skipping rewrite
🔵 FinalQuery     Final query for retrieval: 'what animals are described?' (unchanged)

💡 ### Answer
   The following distinct animal species are described in the context:
   - Hedgehogs, Cats, Fish, Dogs, Lions, Apes (Gorillas and Orangutans),
     Elephants, Pferde (Horses)

   ### Sources
   Hedgehogs.pdf · Cats.md · Fish.txt · Dogs.png · Lions.pptx
   Apes.docx · Elephants.jpg · Pferde.pdf

No rewrite happens on turn 1 — there is no history yet, so the query goes straight through.

Pass 2 — Re-translation after rewriting

Multi-turn RAG breaks on pronouns. A query rewrite step fixes that, but introduces a subtler problem.

When entity names are copied from English history into a non-English sentence template, the rewritten query can be mixed-language:

Sind elephants, hedgehogs Säugetiere?

That output is internally consistent. It is also unusable by an English-only retriever.

A concrete turn-by-turn:

1. User (DE):       Was für Tiere sind beschrieben?
2. Translated:      What animals are described?
3. Assistant (EN):  Cats, horses, hedgehogs, dogs, fish, apes, elephants, lions.
4. User (EN):       Are they mammals?
5. Rewritten:       Are Hedgehogs, Cats, Fish, Dogs, Lions, Gorillas,
                    Orangutans, Elephants, and Horses mammals?
6. Re-translated:   (already English — pass short-circuits)
7. Retrieval:       Are Hedgehogs, Cats, Fish, Dogs, Lions, Gorillas,
                    Orangutans, Elephants, and Horses mammals?

Step 5 is what makes retrieval work. The pronoun "they" is gone; explicit entity names from the previous answer replace it. Here is the log for turn 2:

💬 Your query>  are they mammals?

🔵 UserQuery      Original user query: 'are they mammals?'
🔵 LangDetect     Detected language: English (en) — confidence: 100%
🔵 QueryRewrite   Rewriting query using 1 history turns
🔵 QueryRewrite   'are they mammals?' → 'are Hedgehogs, Cats, Fish, Dogs,
↳                 Lions, Gorillas, Orangutans, Elephants, and Horses mammals?'
🔵 QueryRewrite   Final query: are Hedgehogs, Cats, Fish, Dogs, Lions,
↳                 Gorillas, Orangutans, Elephants, and Horses mammals?
🔵 LangDetect     Detected language: English (en) — confidence: 100%
🔵 FinalQuery     Final query for retrieval: 'are Hedgehogs, Cats, Fish,
↳                 Dogs, Lions, Gorillas, Orangutans, Elephants, and Horses mammals?'
                  (was: 'are they mammals?')

💡 ### Answer
   Hedgehogs, Cats, Dogs, Lions, Gorillas, Orangutans, and Elephants are
   mammals. The context is silent on whether Horses are explicitly stated
   as mammals in the provided text.

The rewriter resolves the coreference correctly. Retrieval gets a self-contained query with real entity names.

The pattern holds across deeper turns too. On turn 3, "these mammals" gets resolved against the confirmed mammal list from the previous answer:

💬 Your query>  give me details you find about these mammals

🔵 QueryRewrite   Rewriting query using 2 history turns
🔵 QueryRewrite   'give me details you find about these mammals' →
↳                 'give me details about Hedgehogs, Cats, Dogs, Lions,
                  Gorillas, Orangutans, and Elephants'

Why the second translate pass is worth running

It is reasonable to ask whether a second translation pass is really necessary. After all, the user query was already translated to English before the rewriter ran — so the rewriter should have only English material to work with.

In practice, two things keep this from holding:

The rewriter's context is mixed. It reads recent stored turns as USER: … / ASSISTANT: … pairs, which means it sees the user's original-language wording and the assistant's English answers. Whichever side it copies from, the result can land in either language.
Rewriting is not translation. Rewriting models are tuned to resolve references and stay close to the user's phrasing, not to enforce a target language. When entity names from one language are dropped into a sentence template from another, the output is faithful to its inputs but no longer monolingual.

A short detect-and-translate pass on the rewritten query is cheap when the output is already English (detection runs; translation skips) and rescues the mixed-language case when it is not.

The fix is simple but often skipped:

Treat rewritten queries as raw input and run language detection and translation again.

Known failure mode: the rewriter over-triggers

The rewriter assumes that any query arriving alongside chat history must contain a reference worth resolving. Small LLMs do not reliably detect the absence of coreference, so a self-contained query — one that never actually pointed at anything — can still get entity-substituted from history.

Concrete example. A new chat session starts. The user's first message is "Was für Tiere sind beschrieben?" ("What animals are described?"). One history turn exists from a previous session whose answer listed hedgehogs, cats, fish, and so on. (This is a realistic condition in any setup where history is stored externally — shared session stores, reconnecting clients, or persistent memory features — rather than being cleared when a browser tab closes.) The rewriter reads that turn, finds entity candidates, and produces:

What animals are Hedgehogs, Cats, Fish, Dogs, Lions, Apes, and Elephants?

The query is now semantically broken: instead of asking the corpus what animals it mentions, it asks whether a specific hardcoded list of animals are animals. The LLM answers as a taxonomy question and ignores the documents entirely.

Mitigations and their tradeoffs

Approach	Prevents over-triggering?	Downsides
`use_chat_context: false`	✅ fully — no history, no rewrite	Loses all coreference resolution
`max_history_turns: 1`	⚠️ partially — only if the bad turn falls outside the 1-turn window	If the prior answer is the immediately preceding turn, the material is still there
Jaccard overlap gate	⚠️ partially — very similar queries (overlap ≥ threshold) are skipped	False-skips valid coreference on translated queries: German "welches sind Säugetiere?" → "What are mammals?" after the translation model drops the determiner → overlap ≈ 0 → the gate fires and skips a query that genuinely needs rewriting
Explicit follow-up phrasing	✅ user-side — "which of those are mammals?" always has a clear referent	Requires user discipline

The Jaccard gate was the original gating mechanism in this codebase and was removed precisely because of the translated-query false-skip problem. It solves the over-triggering case but breaks the core use case the rewriter exists for. There is no free lunch here; the right setting depends on whether your users mix sessions or always start fresh.

The full pipeline

If you don’t read flowcharts fluently, you can skip this and continue—the conclusions don’t depend on it.

The three blue nodes are translation steps. The rewrite node (green) sits between them — it is what makes pass 2 necessary.

Putting it together

In the worst case, a single multi-turn query involves:

one user-query translation (Pass 1)
an LLM rewrite using stored history
an optional re-translation of the rewritten query (Pass 2)
the final LLM answer, returned in the user's original language

Careful detection and caching keep the total overhead manageable. When the query and the rewritten query are both already English, both translation calls short-circuit immediately.

The retrieval–generation asymmetry

The two translation passes described here keep the query in a language the vector store and reranker can match against. They say nothing about the documents.

A corpus document in a language different from English (say, a PDF written entirely in German) will still be retrieved correctly — the embedding model handles multilingual similarity — but the generation LLM may silently skip or ignore that chunk when composing its answer in English. The retrieval log will show the chunk selected; the answer will omit the information.

This is visible in practice: with all other test documents in English and one German document (Pferde.pdf), the query "what animals are described?" retrieves three Pferde.pdf chunks above threshold, yet the answer lists every other animal and omits horses entirely.

The root cause is the asymmetry between:

Retrieval — handled by embedding and reranking models that were trained multilingually and score cross-lingual similarity well
Generation — handled by the chat LLM, which is given chunks in whatever language they were indexed in and is expected to synthesize them into a coherent answer

The two fixes are:

Approach	Effect	Cost
Translate documents to English at ingestion time	Retrieval and generation both work	Permanent content transformation; original language lost in storage
Add an explicit instruction to the system prompt to handle multilingual chunks	LLM is reminded to use all chunks regardless of language	Costs tokens; not guaranteed; LLM may still miss content

The translation passes in this article do not solve this — they fix the query path. The document path is a separate problem.

Below an example of a conversation with query translation and rewrite:

Takeaways

Detect first; translate only when needed — language detection on every turn, not once per session
Do not assume rewriting preserves language — rewritten queries must be treated as raw input
The rewriter's context is mixed-language by nature — users write in one language, answers reference entities from another
A second translate pass is cheap — it is a no-op for English output and a correctness fix for mixed output

Pronoun and coreference rewriting is essential for multi-turn chat — without it, follow-up questions like "are they mammals?" lose their referent the moment they are translated. The two translation passes described above are what keep that rewrite step working once users stop typing in English.

Source code

The complete implementation is available here:

github.com/HarinezumIgel/RAG-LCC

If you find the approach useful and think it could help other people building multilingual RAG systems, leaving a star on the repository can make it easier for them to discover it.

Top comments (1)

HarinezumIgel • Apr 23

Author here. Happy to answer questions or clarify details.