⚡ Understanding RAG by Building a ChatPDF App: Better Chunking & Smarter Context
In Part 1, we made it work.
In Part 2, we made it fast.
In Part 3, things got… interesting 😅
📌 Recap from Part 1 & 2
In the previous parts:
👉 Part 1
- Built a basic RAG pipeline using NumPy
- Understood embeddings + similarity search
👉 Part 2
- Switched to FAISS for fast retrieval ⚡
- Added persistence + re-ranking
At this point, everything looked solid.
😅 But Something Still Felt Off
I started testing with real questions…
query = "What is FAISS indexing?"
And sometimes the answer would:
- Talk about embeddings instead
- Miss key details
- Or feel… slightly off
🤔 The weird part?
The answer was actually present in the document.
But we weren’t retrieving the right chunk.
🧠 The Real Problem Was Not Search
FAISS was doing its job.
The issue was earlier in the pipeline:
We were feeding it bad chunks.
🔍 Let’s Look at the Old Chunking Logic
def generate_chunks(text, page_num):
chunks = []
i = 0
while i < len(text):
end = min(i + CHUNK_SIZE, len(text))
chunk = text[i:end]
if end < len(text):
last_space = chunk.rfind(" ")
if last_space != -1:
end = i + last_space
chunk = text[i:end]
chunks.append({"text": chunk, "page": page_num})
i = end - OVERLAP_SIZE
🧠 What This Was Doing
- Split text using fixed size
- Avoid breaking words
- Add overlap
Looks reasonable… right?
🚨 Where It Breaks
Let’s take a simple example:
Original:
"FAISS is a library for efficient similarity search. It is widely used in RAG systems."
Now imagine this gets chunked like:
Chunk 1:
"FAISS is a library for efficient similarity"
Chunk 2:
"search. It is widely used in RAG systems"
💥 What just happened?
- The sentence got split
- Meaning got split
- Embeddings lost context
Embeddings don’t understand fragments
They understand complete ideas
🔍 Let’s Visualize This
Let’s visualize what was actually happening 👇
👉 Notice how sentences are broken across chunks —
this is exactly what degrades retrieval quality.
💡 The Shift in Thinking
Instead of:
“Split text by size”
We need:
“Split text by meaning”
🚀 Step 1: Recursive Chunking (Respect Structure)
✅ New Approach
def generate_chunks_recursive(text, page_num, chunk_size, overlap_size):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk_slice = text[start:end]
for separator in ["\n\n", "\n", ". ", " "]:
last_break = chunk_slice.rfind(separator)
if last_break != -1:
if separator == ". ":
last_break += 1
break
else:
last_break = chunk_size
actual_end = start + last_break
final_chunk = text[start:actual_end].strip()
chunks.append({"text": final_chunk, "page": page_num})
start = actual_end - overlap_size
🧠 What Changed Here?
Instead of blindly splitting, we now:
for separator in ["\n\n", "\n", ". ", " "]:
We try:
- Paragraph
- Line
- Sentence
- Word
👉 This is a priority-based splitting strategy
💡 Why This Works Better
- Paragraphs stay intact
- Sentences stay intact
- Meaning stays intact
✅ Micro Summary
- What changed: Structure-aware chunking
- Why it matters: Better embeddings → better retrieval
🔁 Step 2: Overlap Still Matters
We still keep overlap:
[Chunk 1] "RAG systems work by retrieving relevant context"
[Chunk 2] "retrieving relevant context from documents"
🧠 Why This Is Important
- Prevents context gaps
- Keeps continuity between chunks
🔥 Step 3: Storing Full Context (Big Upgrade)
def generate_advanced_chunks(page_content, page_num):
search_chunks = generate_chunks_recursive(...)
for chunk in search_chunks:
chunk["text"] = f"[Page {page_num}] {chunk['text']}"
chunk["full_context"] = page_content
🧠 Why This Matters
Earlier:
👉 We only stored chunk text
Now:
👉 We also store the entire page
💡 What This Enables
- Better answer generation
- Flexibility for later processing
- Smarter context selection
🚨 New Problem Introduced
Now that we store full pages…
We started sending too much data to the LLM.
❌ Problem
- Large token usage
- Slower responses
- Irrelevant information
🚀 Step 4: Context Compression
def compress_context(query, full_text):
sentences = re.split(r'(?<=[.!?]) +', full_text)
query_words = set(query.lower().split())
scored_sentences = []
for s in sentences:
score = sum(1 for word in s.lower().split() if word in query_words)
scored_sentences.append((score, s))
top_sentences = sorted(scored_sentences, key=lambda x: x[0], reverse=True)[:MAX_SENTENCES]
return " ".join([s for _, s in top_sentences])
🧠 What’s Happening Here?
- Break into sentences
- Score relevance using query
- Keep only top sentences
🔍 Visualize It
Before:
Full page → 1000+ tokens ❌
After:
Relevant sentences → smaller context ✅
💡 Why This Is Powerful
- Faster responses
- Better relevance
- Lower token usage
✅ Micro Summary
- What changed: Context filtering
- Why it matters: Less noise → better answers
🔍 Retrieval Still Uses FAISS + Re-ranking
distances, indices = index.search(query_vector.reshape(1,-1), k=10)
results = ranker.rerank(rerank_request)
🧠 Flow
- FAISS → fast retrieval
- Re-ranker → improves relevance
- Top results → passed to LLM
💬 Smarter Answer Generation
compressed_text = compress_context(query, res['full_context'])
👉 Instead of raw chunks, we now send:
Focused, relevant context
🔁 Final System
🧠 What Changed Overall
| Feature | Part 1 | Part 2 | Part 3 |
|---|---|---|---|
| Search | NumPy | FAISS | FAISS + Rerank |
| Chunking | Basic | Basic | Recursive 🧠 |
| Context | Raw | Raw | Compressed 🔥 |
| Accuracy | Low | Medium | High |
🧠 Final Thought
This is where things clicked for me.
I kept thinking better models would fix my system…
But the real issue was:
Bad context in → bad answers out
Most people focus on:
- Models ❌
- Vector DB ❌
But the real gains came from:
👉 Chunking
👉 Context handling
🔜 What’s Next?
Now things get even more interesting.
In Part 4:
👉 We’ll move beyond basic retrieval and make the system smarter
- Token-aware chunking
- Better query understanding
- More intelligent retrieval
💬 Let’s Connect
If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts 👇



Top comments (0)