WonderLab

Posted on May 17

RAG Series (18): Conversational RAG — The Pronoun Problem in Multi-Turn Dialogue

#ai #rag #langchain #ragas

The Hidden Assumption in Single-Turn RAG

Every article in this series so far has worked with one type of question: a standalone, self-contained query that retrieves documents and generates an answer.

Real conversations don't work like that.

After asking "What is RAGAS?", a user naturally continues:

Turn 1: What is RAGAS?
Turn 2: What are its four core metrics?
Turn 3: Which one is hardest to improve, and why?

Turn 1 is fine. "Its" in Turn 2 refers to RAGAS. "Which one" in Turn 3 refers to the four metrics mentioned in Turn 2. To a human, the referent is obvious. To a retrieval system, "what are its four core metrics?" is a query with no subject — the vector search will find documents semantically similar to "its four metrics," which could be anything.

This is single-turn RAG's hidden assumption: every question is independent and complete. The moment follow-up questions appear, this assumption breaks.

History-Aware Retriever: Rewrite Before You Retrieve

The fix is straightforward: before retrieval, use one LLM call to combine the current question with the conversation history and rewrite it into a standalone, self-contained question. Then use the rewritten question for retrieval.

Turn 1: What is RAGAS?                → retrieve directly (no history)
Turn 2: What are its four metrics?
        ↓ combine with Turn 1 history
        "What are the four core metrics in the RAGAS framework?"
        ↓ retrieve using rewritten question
Turn 3: Which one is hardest to improve?
        ↓ combine with Turn 1+2 history
        "Among RAGAS's four metrics, which is hardest to improve, and why?"
        ↓ retrieve using rewritten question

LangChain provides create_history_aware_retriever for this pattern, but to guard against verbose LLM output triggering the embedding model's 512-token limit, this implementation builds the chain manually with a truncation step:

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableBranch, RunnableLambda

def _extract_standalone_question(text: str) -> str:
    """Keep only the first line — guards against verbose LLM output
    exceeding the embedding model's 512-token input limit."""
    lines = [l.strip() for l in text.strip().split("\n") if l.strip()]
    question = lines[0] if lines else text
    return question[:400]  # hard cap

_contextualize_chain = (
    CONTEXTUALIZE_PROMPT
    | llm
    | StrOutputParser()
    | RunnableLambda(_extract_standalone_question)
)

# No history → retrieve directly; history present → rewrite first
history_aware_retriever = RunnableBranch(
    (
        lambda x: not x.get("chat_history"),
        (lambda x: x["input"]) | retriever,
    ),
    _contextualize_chain | retriever,
)

The Architecture

The Contextualize Prompt

CONTEXTUALIZE_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "Given the conversation history and the latest question, rewrite the "
     "question as a standalone, self-contained question.\n"
     "Requirements:\n"
     "- Replace all pronouns (it, this, these, which one, etc.) with specific nouns\n"
     "- Fill in any omitted subjects or objects\n"
     "- Output only the rewritten question, no explanation\n"
     "If the question is already complete and standalone, return it unchanged."),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

Three design decisions worth noting:

"Output only the question" — without this explicit constraint, the LLM explains its reasoning, producing output that far exceeds what an embedding model can handle
History placed between system and human — MessagesPlaceholder("chat_history") expands to the full message list at that position
Unchanged passthrough condition — Turn 1 or semantically complete questions don't need rewriting; give the LLM an exit

The Full ConvRAG Chain

# Step 1: History-aware retrieval
history_aware_retriever = ...   # see above

# Step 2: Generate answer using retrieved docs + conversation history
ANSWER_PROMPT = ChatPromptTemplate.from_messages([
    ("system",
     "You are a RAG technology expert. Answer based on the reference material.\n"
     "Reference material:\n{context}"),
    MessagesPlaceholder("chat_history"),   # history also informs generation
    ("human", "{input}"),
])
qa_chain  = create_stuff_documents_chain(llm, ANSWER_PROMPT)
rag_chain = create_retrieval_chain(history_aware_retriever, qa_chain)

# Step 3: Session-based history management
store: dict[str, ChatMessageHistory] = {}

def get_session_history(session_id: str) -> ChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

conv_rag = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

Each session_id maps to an isolated conversation history. RunnableWithMessageHistory automatically injects history before each invoke and appends the new Q&A pair afterward.

Question Rewriting Results

Three test conversations, showing the rewriting output for each Turn 2 follow-up:

[RAGAS follow-up]
  Original:  What are its four core metrics?
  Rewritten: What are the four core metrics in the RAGAS framework?

[Vector DB follow-up]
  Original:  Which one is best for production?
  Rewritten: Among the common vector databases (Chroma, Pinecone, Milvus, Qdrant),
             which is most suitable for a production environment?

[Advanced RAG follow-up]
  Original:  What about Graph RAG and Agentic RAG?
  Rewritten: What problems do Graph RAG and Agentic RAG each solve?

"Its" becomes "in the RAGAS framework." "Which one" expands to a full list of the databases mentioned in Turn 1. Without this disambiguation, these questions produce garbage retrieval results. With it, they retrieve the right documents.

Retrieval Comparison: The Real Story in Turn 2

Retrieving with "What are its four core metrics?" directly versus retrieving with "What are the four core metrics in the RAGAS framework?":

Baseline retrieved (raw: "What are its four core metrics?"):
  doc1: RAG core workflow: Retrieval → Augmentation → Generation.
        RAG was introduced by Meta AI in 2020...
  doc2: Document chunking strategies affect RAG retrieval quality:
        fixed-size chunking (chunk_size=512-1024) works for general cases...

ConvRAG retrieved (rewritten: "What are the four core metrics in RAGAS?"):
  doc1: RAGAS is an evaluation framework designed specifically for RAG systems,
        introduced by Es et al. in 2023.
        The four core metrics: 1. context_recall... 2. context_precision...
  doc2: Embedding models convert text to vectors, setting the quality ceiling
        for semantic retrieval...

Baseline retrieves the RAG introduction and chunking strategies — both about RAG, but neither contains the RAGAS metrics. ConvRAG retrieves the RAGAS document directly. The gap is qualitative, not marginal.

RAGAS Metrics: An Interesting Reversal

======================================================================
  RAGAS Metrics Comparison (Baseline vs Conversational RAG)
======================================================================

  Metric               Baseline       ConvRAG        Delta
  ──────────────────────────────────────────────────────────────
  context_recall          0.667          0.400      ↓-0.267  ◀
  context_precision       0.880          0.870      →-0.010
  faithfulness            1.000          1.000      →+0.000
  answer_relevancy        0.432          0.430      →-0.002
======================================================================
  Note: Evaluated on the final turn (Turn 3) of each 3-turn conversation

ConvRAG's context_recall is 0.267 lower than Baseline. That's counterintuitive — why would "better retrieval" produce less relevant context?

The answer is in what RAGAS actually evaluated.

The evaluation ran on Turn 3 of each conversation:

"Which metric is hardest to improve, and why?"
"If my team is just starting with RAG, which database should we choose?"
"What is the evolutionary relationship between these four techniques?"

These Turn 3 questions are semantically complete on their own. Even without conversation history, retrieving directly on these questions finds the right documents. The Baseline does exactly that — and it works.

ConvRAG takes Turn 3 questions and rewrites them incorporating the accumulated history. "What is the evolutionary relationship between these four techniques" might become "What is the evolutionary relationship between Self-RAG, CRAG, Graph RAG, and Agentic RAG" — semantically richer, but the changed phrasing may cause the retrieval to land on slightly different documents, reducing context_recall.

RAGAS failed to capture Conversational RAG's core value.

The value is in Turn 2 — pronoun disambiguation turning a failed retrieval into a correct one. RAGAS evaluated Turn 3, where the questions happened to work without history. The experiment design favored the Baseline scenario, obscuring ConvRAG's genuine contribution.

This is a recurring theme in this series: metrics measure what they measure. Always ask — what scenario did the metric actually test? What did it miss?

When to Use Conversational RAG

Scenario	Baseline RAG	Conversational RAG
Every question is standalone	✅ Direct retrieval, low cost	⚠️ Rewriting adds latency and cost
Follow-ups with pronouns ("it", "which one")	❌ Retrieval fails	✅ Disambiguation → correct retrieval
Follow-ups with omitted subjects	❌ Retrieval fails	✅ Subject restored → correct retrieval
Multi-turn deep exploration of a topic	⚠️ No context accumulation	✅ Coherent, history-informed answers

Memory management trade-offs: this implementation keeps the full conversation history. It's accurate but token cost grows with each turn. Common production alternatives:

Sliding window: keep only the last N turns
Summary memory: compress older turns into a summary via LLM, keep the most recent 1–2 turns in full detail

The choice depends on conversation length and how far back the relevant context might reach.

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/18-conversational-rag

Key file:

conversational_rag.py — full implementation: two pipelines, question rewriting demo, RAGAS evaluation

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 18-conversational-rag
cp .env.example .env
pip install -r requirements.txt
python conversational_rag.py

Summary

This article implemented Conversational RAG. Key findings:

Pronoun disambiguation is the core problem — "what are its four metrics?" retrieves completely irrelevant documents; the Turn 2 retrieval comparison makes this gap unmistakable
Question rewriting works well — GLM-4-flash accurately rewrites "what are its four metrics?" to "what are the four core metrics of the RAGAS framework?"; disambiguation quality is solid
RAGAS showed a reversal — ConvRAG's context_recall was lower (0.400 vs 0.667), because the Turn 3 test questions were semantically complete on their own; direct retrieval happened to work fine for those specific questions
Metrics and scenario value diverged most sharply here — the value of Conversational RAG lies in the "pronoun follow-up fails" scenario, which RAGAS didn't test; the numbers don't reflect the actual benefit

Across this series: Self-RAG asked "should we retrieve?", CRAG asked "is what we retrieved good enough?", Graph RAG handled relational reasoning, Agentic RAG unified them into a decision loop, and Conversational RAG now handles the temporal dimension — making each question aware of what came before. Each one expands the range of scenarios the system handles correctly.

DEV Community