DEV Community: Sharath Kurup

Engineering RAG Systems That Actually Work: Conversational Retrieval, Page Awareness & Debugging (Part 5)

Sharath Kurup — Mon, 18 May 2026 01:16:00 +0000

From a working prototype to something that actually behaves like a real system.

Introduction

In the previous parts, we built a fairly solid RAG pipeline:

Vector search using FAISS
Better chunking strategies
Context optimization
HyDE for improving retrieval quality

At that point, the system worked. It could answer questions from a document.

But it still didn’t feel right.

If you’ve used systems like ChatGPT, you know the difference immediately. The answers are not just correct — they are contextual, conversational, and consistent across turns.

Our RAG system wasn’t there yet.

What was still missing?

A few practical issues started showing up:

Follow-up questions like “explain more” didn’t work well
Queries without enough context failed silently
Retrieval sometimes returned semantically similar but irrelevant chunks
There was no clear way to debug why something went wrong

This part focuses on solving exactly those problems.

Goal for this part

Move from:

Query → Retrieve → Answer

To something closer to:

Understand → Retrieve → Validate → Answer → Debug

1. Conversational Retrieval

A basic RAG pipeline treats every query independently. That works for demos, but not for real usage.

Users don’t ask perfectly structured questions every time. They ask things like:

“Explain this”
“Continue”
“What does that mean?”

Without context, these queries are meaningless.

Making queries context-aware

Instead of using only the current query, we include recent conversation history.

You’re already doing this in two places.

HyDE (used with context)

def generate_hypothetical_answer(query, chat_history):
    recent_user = [m["content"] for m in chat_history[-4:] if m["role"] == "user"]
    history_text = " | ".join(recent_user[-2:])

    prompt = (f"Write a 2-sentence technical summary answering: {query}\n"
              f"Recent user context: {history_text}")

    response = ollama.generate(model=HYDE_MODEL, prompt=prompt, stream=False)
    return response['response']

This makes the generated “hypothetical answer” more aligned with the conversation, not just the current query.

Passing conversation into the final answer

history_text = "\n".join([
    f"{history['role']}: {history['content']}"
    for history in chat_history[-5:]
])

This allows the model to generate responses that are consistent across turns.

Result

Queries become less brittle
Retrieval improves for vague inputs
Responses feel connected instead of isolated

Conversational Retrieval Flow
(User query + history → query enhancement → retrieval → answer)

2. Page-Aware Search

One thing becomes very obvious once we start testing real documents:

Not every query should be handled using semantic search.

The problem

If a user asks:

“What is written on page 5?”

A pure vector search might return:

Content from page 3 that looks semantically similar
Completely miss the actual page

This is technically “correct” from a similarity perspective, but practically wrong.

The fix: introduce a deterministic path

Instead of forcing everything through FAISS, we add a second mode.

Detecting page intent

page_match = re.search(r"page\s+(\d+)", query.lower())
target_page = int(page_match.group(1)) if page_match else None

Restricting search space

allowed_pages = [target_page - 1, target_page, target_page + 1]

candidates = [
    item for item in text_metadata
    if item["page"] in allowed_pages
]

Why this works well

removes ambiguity
reduce noise in retrieval
align results with user intent

It’s a simple idea, but it has a big impact on reliability.

Final behavior

Page-specific queries → deterministic filtering
General queries → semantic search (FAISS + rerank)

Page-aware vs Semantic Flow

3. Handling Follow-Up Queries

This is where the system starts to feel more natural.

The problem

A query like:

Explain more

has no standalone meaning.

Without additional context, retrieval either fails or returns random results.

Detecting vague queries

vague_followups = re.compile(
    r"^(explain(\s+\w+){0,3}|elaborate|tell me more|go on|continue|...)",
    re.IGNORECASE
)

Injecting context from previous answers

last_pages = get_last_referenced_pages(st.session_state.messages)

if last_pages:
    page_inject = last_pages[len(last_pages) // 2]
    query = f"{query} page {page_inject}"

What this achieves

A vague query like:

Explain more

gets transformed into something like:

Explain more page 5

Now retrieval has something concrete to work with.

Result

Follow-up queries become usable
Conversations feel continuous
Retrieval remains grounded in the document

Follow-up Handling Flow

4. Debugging the RAG Pipeline

This is probably the most important upgrade in this part.

Most RAG systems fail not because they are fundamentally broken, but because we can’t see what’s happening inside them.

The usual situation

We get a wrong answer and we don’t know:

Was the query bad?
Was retrieval wrong?
Did reranking fail?
Did the model hallucinate?

Everything looks like a black box.

Adding visibility

So, we introduce a debug layer that exposes each stage.