Sharath Kurup

Posted on May 18

Engineering RAG Systems That Actually Work: Conversational Retrieval, Page Awareness & Debugging (Part 5)

#rag #ai #python #tutorial

From a working prototype to something that actually behaves like a real system.

Introduction

In the previous parts, we built a fairly solid RAG pipeline:

Vector search using FAISS
Better chunking strategies
Context optimization
HyDE for improving retrieval quality

At that point, the system worked. It could answer questions from a document.

But it still didn’t feel right.

If you’ve used systems like ChatGPT, you know the difference immediately. The answers are not just correct — they are contextual, conversational, and consistent across turns.

Our RAG system wasn’t there yet.

What was still missing?

A few practical issues started showing up:

Follow-up questions like “explain more” didn’t work well
Queries without enough context failed silently
Retrieval sometimes returned semantically similar but irrelevant chunks
There was no clear way to debug why something went wrong

This part focuses on solving exactly those problems.

Goal for this part

Move from:

Query → Retrieve → Answer

To something closer to:

Understand → Retrieve → Validate → Answer → Debug

1. Conversational Retrieval

A basic RAG pipeline treats every query independently. That works for demos, but not for real usage.

Users don’t ask perfectly structured questions every time. They ask things like:

“Explain this”
“Continue”
“What does that mean?”

Without context, these queries are meaningless.

Making queries context-aware

Instead of using only the current query, we include recent conversation history.

You’re already doing this in two places.

HyDE (used with context)

def generate_hypothetical_answer(query, chat_history):
    recent_user = [m["content"] for m in chat_history[-4:] if m["role"] == "user"]
    history_text = " | ".join(recent_user[-2:])

    prompt = (f"Write a 2-sentence technical summary answering: {query}\n"
              f"Recent user context: {history_text}")

    response = ollama.generate(model=HYDE_MODEL, prompt=prompt, stream=False)
    return response['response']

This makes the generated “hypothetical answer” more aligned with the conversation, not just the current query.

Passing conversation into the final answer

history_text = "\n".join([
    f"{history['role']}: {history['content']}"
    for history in chat_history[-5:]
])

This allows the model to generate responses that are consistent across turns.

Result

Queries become less brittle
Retrieval improves for vague inputs
Responses feel connected instead of isolated

Conversational Retrieval Flow
(User query + history → query enhancement → retrieval → answer)

2. Page-Aware Search

One thing becomes very obvious once we start testing real documents:

Not every query should be handled using semantic search.

The problem

If a user asks:

“What is written on page 5?”

A pure vector search might return:

Content from page 3 that looks semantically similar
Completely miss the actual page

This is technically “correct” from a similarity perspective, but practically wrong.

The fix: introduce a deterministic path

Instead of forcing everything through FAISS, we add a second mode.

Detecting page intent

page_match = re.search(r"page\s+(\d+)", query.lower())
target_page = int(page_match.group(1)) if page_match else None

Restricting search space

allowed_pages = [target_page - 1, target_page, target_page + 1]

candidates = [
    item for item in text_metadata
    if item["page"] in allowed_pages
]

Why this works well

removes ambiguity
reduce noise in retrieval
align results with user intent

It’s a simple idea, but it has a big impact on reliability.

Final behavior

Page-specific queries → deterministic filtering
General queries → semantic search (FAISS + rerank)

Page-aware vs Semantic Flow

3. Handling Follow-Up Queries

This is where the system starts to feel more natural.

The problem

A query like:

Explain more

has no standalone meaning.

Without additional context, retrieval either fails or returns random results.

Detecting vague queries

vague_followups = re.compile(
    r"^(explain(\s+\w+){0,3}|elaborate|tell me more|go on|continue|...)",
    re.IGNORECASE
)

Injecting context from previous answers

last_pages = get_last_referenced_pages(st.session_state.messages)

if last_pages:
    page_inject = last_pages[len(last_pages) // 2]
    query = f"{query} page {page_inject}"

What this achieves

A vague query like:

Explain more

gets transformed into something like:

Explain more page 5

Now retrieval has something concrete to work with.

Result

Follow-up queries become usable
Conversations feel continuous
Retrieval remains grounded in the document

Follow-up Handling Flow

4. Debugging the RAG Pipeline

This is probably the most important upgrade in this part.

Most RAG systems fail not because they are fundamentally broken, but because we can’t see what’s happening inside them.

The usual situation

We get a wrong answer and we don’t know:

Was the query bad?
Was retrieval wrong?
Did reranking fail?
Did the model hallucinate?

Everything looks like a black box.

Adding visibility

So, we introduce a debug layer that exposes each stage.

1. Query transformation

debug_data["query"] = query
debug_data["search_query"] = search_query

This lets us compare:

Original query
Final query used for retrieval (after HyDE / modifications)

Here’s what this looks like in practice:

2. FAISS retrieval results

distances, indices = index.search(query_vector.reshape(1, -1), k=FAISS_SEARCH_K)

Here’s what this looks like in practice:

3. Reranking results

results = ranker.rerank(RerankRequest(query=query, passages=rerank_items))

Here’s what this looks like in practice:

4. FAISS vs rerank comparison

This is where things get interesting.

We can now see cases where:

FAISS ranks something high
Rerank pushes it down

Or vice versa.

Here’s what this looks like in practice:

5. Score distribution

ax.plot(faiss_scores, label="FAISS")
ax.plot(rerank_scores, label="Rerank")

Here’s what this looks like in practice:

Why this matters

Now when something goes wrong, we can actually trace it:

If FAISS results are bad → embedding / chunking issue
If FAISS is good but rerank is bad → rerank model issue
If both are good → prompt or answer generation issue

This turns debugging into a structured process instead of trial and error.

Final Architecture

Conclusion

At the beginning of this series, we had a simple goal: understand how RAG works by building it ourselves.

Over time, that evolved into something more practical.

We now have a system that:

Handles conversational queries
Deals with vague follow-ups
Respects document structure
Provides visibility into retrieval

More importantly, It’s no longer a black box — it’s a system you can reason about.

What changed

We moved from a basic pipeline:

Search → Answer

to something closer to:

Understand → Retrieve → Validate → Explain → Debug

Where this leaves us

This is no longer just a learning project. It’s a foundation we can build on.

What’s next

In the next series, we’ll move beyond retrieval and look at something more powerful:

Tool-using AI systems
Offline LLM capabilities
Building agents that can take actions

Code

Main branch:
https://github.com/SharathKurup/chatPDF

If you’ve followed along from Part 1 to here, you’ve already done the hard part — building the system from scratch and understanding how it behaves.

From here, it’s no longer about understanding RAG — it’s about building on top of it.

DEV Community

Engineering RAG Systems That Actually Work: Conversational Retrieval, Page Awareness & Debugging (Part 5)

Introduction

What was still missing?

Goal for this part

1. Conversational Retrieval

Making queries context-aware

HyDE (used with context)

Passing conversation into the final answer

Result

2. Page-Aware Search

The problem

The fix: introduce a deterministic path

Detecting page intent

Restricting search space

Why this works well

Final behavior

3. Handling Follow-Up Queries

The problem

Detecting vague queries

Injecting context from previous answers

What this achieves

Result

4. Debugging the RAG Pipeline

The usual situation

Adding visibility

1. Query transformation

2. FAISS retrieval results

3. Reranking results

4. FAISS vs rerank comparison

5. Score distribution

Why this matters

Final Architecture

Conclusion

What changed

Where this leaves us

What’s next

Code

Top comments (0)