DEV Community

Sharath Kurup
Sharath Kurup

Posted on

Understanding RAG by Building a ChatPDF App: Smarter Chunking & Context Optimization (Part 3)

Rag Pipeline

⚡ Understanding RAG by Building a ChatPDF App: Better Chunking & Smarter Context

In Part 1, we made it work.
In Part 2, we made it fast.
In Part 3, things got… interesting 😅


📌 Recap from Part 1 & 2

In the previous parts:

👉 Part 1

  • Built a basic RAG pipeline using NumPy
  • Understood embeddings + similarity search

👉 Part 2

  • Switched to FAISS for fast retrieval ⚡
  • Added persistence + re-ranking

At this point, everything looked solid.


😅 But Something Still Felt Off

I started testing with real questions…

query = "What is FAISS indexing?"
Enter fullscreen mode Exit fullscreen mode

And sometimes the answer would:

  • Talk about embeddings instead
  • Miss key details
  • Or feel… slightly off

🤔 The weird part?

The answer was actually present in the document.

But we weren’t retrieving the right chunk.


🧠 The Real Problem Was Not Search

FAISS was doing its job.

The issue was earlier in the pipeline:

We were feeding it bad chunks.


🔍 Let’s Look at the Old Chunking Logic

def generate_chunks(text, page_num):
    chunks = []
    i = 0
    while i < len(text):
        end = min(i + CHUNK_SIZE, len(text))
        chunk = text[i:end]

        if end < len(text):
            last_space = chunk.rfind(" ")
            if last_space != -1:
                end = i + last_space
                chunk = text[i:end]

        chunks.append({"text": chunk, "page": page_num})
        i = end - OVERLAP_SIZE
Enter fullscreen mode Exit fullscreen mode

🧠 What This Was Doing

  • Split text using fixed size
  • Avoid breaking words
  • Add overlap

Looks reasonable… right?


🚨 Where It Breaks

Let’s take a simple example:

Original:
"FAISS is a library for efficient similarity search. It is widely used in RAG systems."
Enter fullscreen mode Exit fullscreen mode

Now imagine this gets chunked like:

Chunk 1:
"FAISS is a library for efficient similarity"

Chunk 2:
"search. It is widely used in RAG systems"
Enter fullscreen mode Exit fullscreen mode

💥 What just happened?

  • The sentence got split
  • Meaning got split
  • Embeddings lost context

Embeddings don’t understand fragments
They understand complete ideas


🔍 Let’s Visualize This

Let’s visualize what was actually happening 👇

Fixed vs Recursive Chunking


👉 Notice how sentences are broken across chunks —
this is exactly what degrades retrieval quality.


💡 The Shift in Thinking

Instead of:

“Split text by size”

We need:

“Split text by meaning”


🚀 Step 1: Recursive Chunking (Respect Structure)


✅ New Approach

def generate_chunks_recursive(text, page_num, chunk_size, overlap_size):
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk_slice = text[start:end]

        for separator in ["\n\n", "\n", ". ", " "]:
            last_break = chunk_slice.rfind(separator)
            if last_break != -1:
                if separator == ". ":
                    last_break += 1
                break
        else:
            last_break = chunk_size

        actual_end = start + last_break
        final_chunk = text[start:actual_end].strip()

        chunks.append({"text": final_chunk, "page": page_num})
        start = actual_end - overlap_size
Enter fullscreen mode Exit fullscreen mode

🧠 What Changed Here?

Instead of blindly splitting, we now:

for separator in ["\n\n", "\n", ". ", " "]:
Enter fullscreen mode Exit fullscreen mode

We try:

  1. Paragraph
  2. Line
  3. Sentence
  4. Word

👉 This is a priority-based splitting strategy


💡 Why This Works Better

  • Paragraphs stay intact
  • Sentences stay intact
  • Meaning stays intact

✅ Micro Summary

  • What changed: Structure-aware chunking
  • Why it matters: Better embeddings → better retrieval

🔁 Step 2: Overlap Still Matters

We still keep overlap:

[Chunk 1]  "RAG systems work by retrieving relevant context"
[Chunk 2]         "retrieving relevant context from documents"
Enter fullscreen mode Exit fullscreen mode

🧠 Why This Is Important

  • Prevents context gaps
  • Keeps continuity between chunks

🔥 Step 3: Storing Full Context (Big Upgrade)

def generate_advanced_chunks(page_content, page_num):
    search_chunks = generate_chunks_recursive(...)

    for chunk in search_chunks:
        chunk["text"] = f"[Page {page_num}] {chunk['text']}"
        chunk["full_context"] = page_content
Enter fullscreen mode Exit fullscreen mode

🧠 Why This Matters

Earlier:

👉 We only stored chunk text

Now:

👉 We also store the entire page


💡 What This Enables

  • Better answer generation
  • Flexibility for later processing
  • Smarter context selection

🚨 New Problem Introduced

Now that we store full pages…

We started sending too much data to the LLM.


❌ Problem

  • Large token usage
  • Slower responses
  • Irrelevant information

🚀 Step 4: Context Compression

def compress_context(query, full_text):
    sentences = re.split(r'(?<=[.!?]) +', full_text)
    query_words = set(query.lower().split())

    scored_sentences = []
    for s in sentences:
        score = sum(1 for word in s.lower().split() if word in query_words)
        scored_sentences.append((score, s))

    top_sentences = sorted(scored_sentences, key=lambda x: x[0], reverse=True)[:MAX_SENTENCES]
    return " ".join([s for _, s in top_sentences])
Enter fullscreen mode Exit fullscreen mode

🧠 What’s Happening Here?

  1. Break into sentences
  2. Score relevance using query
  3. Keep only top sentences

🔍 Visualize It

Before:
Full page → 1000+ tokens ❌

After:
Relevant sentences → smaller context ✅
Enter fullscreen mode Exit fullscreen mode

💡 Why This Is Powerful

  • Faster responses
  • Better relevance
  • Lower token usage

✅ Micro Summary

  • What changed: Context filtering
  • Why it matters: Less noise → better answers

🔍 Retrieval Still Uses FAISS + Re-ranking

distances, indices = index.search(query_vector.reshape(1,-1), k=10)
results = ranker.rerank(rerank_request)
Enter fullscreen mode Exit fullscreen mode

🧠 Flow

  1. FAISS → fast retrieval
  2. Re-ranker → improves relevance
  3. Top results → passed to LLM

💬 Smarter Answer Generation

compressed_text = compress_context(query, res['full_context'])
Enter fullscreen mode Exit fullscreen mode

👉 Instead of raw chunks, we now send:

Focused, relevant context


🔁 Final System

Final System


🧠 What Changed Overall

Feature Part 1 Part 2 Part 3
Search NumPy FAISS FAISS + Rerank
Chunking Basic Basic Recursive 🧠
Context Raw Raw Compressed 🔥
Accuracy Low Medium High

🧠 Final Thought

This is where things clicked for me.

I kept thinking better models would fix my system…

But the real issue was:

Bad context in → bad answers out


Most people focus on:

  • Models ❌
  • Vector DB ❌

But the real gains came from:

👉 Chunking
👉 Context handling


🔜 What’s Next?

Now things get even more interesting.

In Part 4:

👉 We’ll move beyond basic retrieval and make the system smarter

  • Token-aware chunking
  • Better query understanding
  • More intelligent retrieval

💬 Let’s Connect

If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts 👇


Top comments (0)