DEV Community

Cover image for Engineering RAG Systems That Actually Work: Conversational Retrieval, Page Awareness & Debugging (Part 5)
Sharath Kurup
Sharath Kurup

Posted on

Engineering RAG Systems That Actually Work: Conversational Retrieval, Page Awareness & Debugging (Part 5)

Final Rag Architecture

From a working prototype to something that actually behaves like a real system.


Introduction

In the previous parts, we built a fairly solid RAG pipeline:

  • Vector search using FAISS
  • Better chunking strategies
  • Context optimization
  • HyDE for improving retrieval quality

At that point, the system worked. It could answer questions from a document.

But it still didn’t feel right.

If you’ve used systems like ChatGPT, you know the difference immediately. The answers are not just correct — they are contextual, conversational, and consistent across turns.

Our RAG system wasn’t there yet.


What was still missing?

A few practical issues started showing up:

  • Follow-up questions like “explain more” didn’t work well
  • Queries without enough context failed silently
  • Retrieval sometimes returned semantically similar but irrelevant chunks
  • There was no clear way to debug why something went wrong

This part focuses on solving exactly those problems.


Goal for this part

Move from:

Query → Retrieve → Answer
Enter fullscreen mode Exit fullscreen mode

To something closer to:

Understand → Retrieve → Validate → Answer → Debug
Enter fullscreen mode Exit fullscreen mode

1. Conversational Retrieval

A basic RAG pipeline treats every query independently. That works for demos, but not for real usage.

Users don’t ask perfectly structured questions every time. They ask things like:

  • “Explain this”
  • “Continue”
  • “What does that mean?”

Without context, these queries are meaningless.


Making queries context-aware

Instead of using only the current query, we include recent conversation history.

You’re already doing this in two places.


HyDE (used with context)

def generate_hypothetical_answer(query, chat_history):
    recent_user = [m["content"] for m in chat_history[-4:] if m["role"] == "user"]
    history_text = " | ".join(recent_user[-2:])

    prompt = (f"Write a 2-sentence technical summary answering: {query}\n"
              f"Recent user context: {history_text}")

    response = ollama.generate(model=HYDE_MODEL, prompt=prompt, stream=False)
    return response['response']
Enter fullscreen mode Exit fullscreen mode

This makes the generated “hypothetical answer” more aligned with the conversation, not just the current query.


Passing conversation into the final answer

history_text = "\n".join([
    f"{history['role']}: {history['content']}"
    for history in chat_history[-5:]
])
Enter fullscreen mode Exit fullscreen mode

This allows the model to generate responses that are consistent across turns.


Result

  • Queries become less brittle
  • Retrieval improves for vague inputs
  • Responses feel connected instead of isolated

Conversational Retrieval Flow
(User query + history → query enhancement → retrieval → answer)

Conversational Retrieval Flow

2. Page-Aware Search

One thing becomes very obvious once we start testing real documents:

Not every query should be handled using semantic search.


The problem

If a user asks:

“What is written on page 5?”

A pure vector search might return:

  • Content from page 3 that looks semantically similar
  • Completely miss the actual page

This is technically “correct” from a similarity perspective, but practically wrong.


The fix: introduce a deterministic path

Instead of forcing everything through FAISS, we add a second mode.


Detecting page intent

page_match = re.search(r"page\s+(\d+)", query.lower())
target_page = int(page_match.group(1)) if page_match else None
Enter fullscreen mode Exit fullscreen mode

Restricting search space

allowed_pages = [target_page - 1, target_page, target_page + 1]

candidates = [
    item for item in text_metadata
    if item["page"] in allowed_pages
]
Enter fullscreen mode Exit fullscreen mode

Why this works well

  • removes ambiguity
  • reduce noise in retrieval
  • align results with user intent

It’s a simple idea, but it has a big impact on reliability.


Final behavior

  • Page-specific queries → deterministic filtering
  • General queries → semantic search (FAISS + rerank)

Page-aware vs Semantic Flow

Page-aware vs Semantic Flow

3. Handling Follow-Up Queries

This is where the system starts to feel more natural.


The problem

A query like:

Explain more
Enter fullscreen mode Exit fullscreen mode

has no standalone meaning.

Without additional context, retrieval either fails or returns random results.


Detecting vague queries

vague_followups = re.compile(
    r"^(explain(\s+\w+){0,3}|elaborate|tell me more|go on|continue|...)",
    re.IGNORECASE
)
Enter fullscreen mode Exit fullscreen mode

Injecting context from previous answers

last_pages = get_last_referenced_pages(st.session_state.messages)

if last_pages:
    page_inject = last_pages[len(last_pages) // 2]
    query = f"{query} page {page_inject}"
Enter fullscreen mode Exit fullscreen mode

What this achieves

A vague query like:

Explain more
Enter fullscreen mode Exit fullscreen mode

gets transformed into something like:

Explain more page 5
Enter fullscreen mode Exit fullscreen mode

Now retrieval has something concrete to work with.


Result

  • Follow-up queries become usable
  • Conversations feel continuous
  • Retrieval remains grounded in the document

Follow-up Handling Flow

Follow-up Handling Flow

4. Debugging the RAG Pipeline

This is probably the most important upgrade in this part.

Most RAG systems fail not because they are fundamentally broken, but because we can’t see what’s happening inside them.


The usual situation

We get a wrong answer and we don’t know:

  • Was the query bad?
  • Was retrieval wrong?
  • Did reranking fail?
  • Did the model hallucinate?

Everything looks like a black box.


Adding visibility

So, we introduce a debug layer that exposes each stage.


1. Query transformation

debug_data["query"] = query
debug_data["search_query"] = search_query
Enter fullscreen mode Exit fullscreen mode

This lets us compare:

  • Original query
  • Final query used for retrieval (after HyDE / modifications)

Here’s what this looks like in practice:

Query Analysis Panel

2. FAISS retrieval results

distances, indices = index.search(query_vector.reshape(1, -1), k=FAISS_SEARCH_K)
Enter fullscreen mode Exit fullscreen mode

Here’s what this looks like in practice:

FAISS Results

3. Reranking results

results = ranker.rerank(RerankRequest(query=query, passages=rerank_items))
Enter fullscreen mode Exit fullscreen mode

Here’s what this looks like in practice:

Re-Rank Results

4. FAISS vs rerank comparison

This is where things get interesting.

We can now see cases where:

  • FAISS ranks something high
  • Rerank pushes it down

Or vice versa.


Here’s what this looks like in practice:

FAISS vs Re-Rank Comparison

5. Score distribution

ax.plot(faiss_scores, label="FAISS")
ax.plot(rerank_scores, label="Rerank")
Enter fullscreen mode Exit fullscreen mode

Here’s what this looks like in practice:

Score Distribution Graph

Why this matters

Now when something goes wrong, we can actually trace it:

  • If FAISS results are bad → embedding / chunking issue
  • If FAISS is good but rerank is bad → rerank model issue
  • If both are good → prompt or answer generation issue

This turns debugging into a structured process instead of trial and error.


Final Architecture

Final Architecture


Conclusion

At the beginning of this series, we had a simple goal: understand how RAG works by building it ourselves.

Over time, that evolved into something more practical.

We now have a system that:

  • Handles conversational queries
  • Deals with vague follow-ups
  • Respects document structure
  • Provides visibility into retrieval

More importantly, It’s no longer a black box — it’s a system you can reason about.


What changed

We moved from a basic pipeline:

Search → Answer
Enter fullscreen mode Exit fullscreen mode

to something closer to:

Understand → Retrieve → Validate → Explain → Debug
Enter fullscreen mode Exit fullscreen mode

Where this leaves us

This is no longer just a learning project. It’s a foundation we can build on.


What’s next

In the next series, we’ll move beyond retrieval and look at something more powerful:

  • Tool-using AI systems
  • Offline LLM capabilities
  • Building agents that can take actions

Code

Main branch:
https://github.com/SharathKurup/chatPDF


If you’ve followed along from Part 1 to here, you’ve already done the hard part — building the system from scratch and understanding how it behaves.

From here, it’s no longer about understanding RAG — it’s about building on top of it.

Top comments (0)