DEV Community

Sharath Kurup
Sharath Kurup

Posted on

Understanding RAG by Building a ChatPDF App: From NumPy to FAISS (Part 2)

⚡ From NumPy to FAISS: Making ChatPDF Fast & Scalable – Part 2

In Part 1, we made it work.
In Part 2, we make it usable 🚀


📌 Recap from Part 1

In Part 1, we built a ChatPDF app using:

  • PDF → Text → Chunks
  • Embeddings using Ollama
  • Similarity search using NumPy
  • LLM to generate answers

It worked well for small PDFs and helped us understand RAG from first principles.

But once I started testing with slightly larger PDFs…


😅 The Problem Started Showing Up

The issue was not correctness — it was performance.

Let’s revisit what we were doing during search.


❌ NumPy Search (What we had before)

similarities = np.dot(vector_db, query_vector)
top_indices = np.argsort(similarities)[-TOP_K:][::-1]
Enter fullscreen mode Exit fullscreen mode

🧠 What’s actually happening here?

Every time you ask a question:

  1. Compute similarity with every chunk
  2. Store all similarity scores
  3. Sort the entire list
  4. Pick top K

🚨 Why this becomes a problem

  • Time complexity → O(n) per query
  • More chunks = slower search
  • Entire dataset scanned every time

To make this visible, I added timing:

start_time = time.perf_counter()
# similarity logic
end_time = time.perf_counter()
print(f"Total time with numpy: {execution_time}")
Enter fullscreen mode Exit fullscreen mode

And as the number of chunks increased…
⏳ the delay became noticeable.


💡 So What’s the Solution?

Instead of:

“Search through everything every time”

We need:

“A system that knows where to look


🔍 Let’s Visualize the Problem (This is the key moment)

👉 This is where the real difference becomes obvious:

Brute Force vs Index Search

💡 What this diagram shows:

  • NumPy → scans every single chunk
  • FAISS → directly jumps to the most relevant results

This is the exact shift from:

brute force → intelligent retrieval


🚀 Introducing FAISS

FAISS (Facebook AI Similarity Search) is built for:

  • Fast vector similarity search
  • Efficient indexing
  • Handling large datasets

The key idea:

👉 Build an index once → search efficiently many times


🔄 Step 1: Moving from Raw Vectors → FAISS Index


❌ Before (NumPy mindset)

We stored vectors like this:

vector_db = np.array(vectors, dtype=np.float32)
Enter fullscreen mode Exit fullscreen mode

That’s it.

No structure. No optimization. Just raw data.


✅ After (FAISS approach)

vector_np = np.array(vectors).astype('float32')

dimension = vector_np.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(vector_np)
Enter fullscreen mode Exit fullscreen mode

🧠 Let’s understand this properly


1️⃣ Converting to float32

vector_np = np.array(vectors).astype('float32')
Enter fullscreen mode Exit fullscreen mode

FAISS requires vectors in float32.

Even if your embeddings are already floats, doing this ensures:

  • Compatibility
  • No runtime surprises

2️⃣ Getting the dimension

dimension = vector_np.shape[1]
Enter fullscreen mode Exit fullscreen mode

Each embedding looks like:

[0.12, -0.45, 0.88, ...]
Enter fullscreen mode Exit fullscreen mode

The number of elements = dimension

FAISS needs this to build the index correctly.


3️⃣ Creating the index

index = faiss.IndexFlatIP(dimension)
Enter fullscreen mode Exit fullscreen mode
  • IndexFlatIP → Inner Product search
  • Since embeddings are normalized → 👉 Inner Product ≈ Cosine Similarity

So we are essentially saying:

“Store these vectors and allow fast similarity-based search.”


4️⃣ Adding vectors to FAISS

index.add(vector_np)
Enter fullscreen mode Exit fullscreen mode

This step:

  • Loads all embeddings into FAISS
  • Builds the internal structure

👉 From here, we stop thinking in terms of arrays and start thinking in terms of an index


🎯 Big Concept Shift

NumPy FAISS
Raw vectors Indexed vectors
Manual search Optimized search
Full scan Smart retrieval

🔍 Step 2: Searching with FAISS


❌ Before (NumPy)

similarities = np.dot(vector_db, query_vector)
top_indices = np.argsort(similarities)[-TOP_K:][::-1]
Enter fullscreen mode Exit fullscreen mode

✅ After (FAISS)

distances, indices = index.search(query_vector.reshape(1, -1), k=TOP_K)
Enter fullscreen mode Exit fullscreen mode

🧠 Let’s break this down


1️⃣ Why reshape?

query_vector.reshape(1, -1)
Enter fullscreen mode Exit fullscreen mode

FAISS expects:

[number_of_queries, dimension]
Enter fullscreen mode Exit fullscreen mode

Even a single query must be shaped like:

[[embedding]]
Enter fullscreen mode Exit fullscreen mode

2️⃣ What does search() do?

distances, indices = index.search(...)
Enter fullscreen mode Exit fullscreen mode

FAISS:

  • Finds nearest vectors
  • Sorts internally
  • Returns top K

3️⃣ Mapping results back

[text_metadata[i] for i in indices[0]]
Enter fullscreen mode Exit fullscreen mode

We use indices to fetch:

  • Actual text chunks
  • Page numbers

💡 Why this is powerful

Instead of:

  • Writing similarity logic ❌
  • Writing sorting logic ❌

You now:

👉 Call one optimized function


💾 Step 3: Avoid Recomputing Everything


🚨 Problem in Part 1

Every run:

  • Read PDF
  • Chunk text
  • Generate embeddings
  • Build vectors

✅ Solution: Save the Index

faiss.write_index(index, "db/index.faiss")

with open("db/metadata.pkl", "wb") as f:
    pickle.dump(data, f)
Enter fullscreen mode Exit fullscreen mode

🧠 What are we saving?

  • FAISS index → vector structure
  • Metadata → chunk + page info
  • PDF hash → detect changes

🔁 Loading instead of recomputing

index = faiss.read_index("db/index.faiss")
Enter fullscreen mode Exit fullscreen mode

Now:

  • ⚡ Faster startup
  • ❌ No repeated embedding calls

🔐 Step 4: Detecting PDF Changes


def calculate_pdf_hash():
    sha256_hash = hashlib.sha256()
Enter fullscreen mode Exit fullscreen mode

🧠 Why this matters

If the PDF changes:

  • Old embeddings become invalid

So we:

  • Generate hash
  • Compare with stored hash
  • Rebuild only if needed

👉 Small addition, big impact.


🔥 Step 5: Improving Retrieval with Re-ranking


Even FAISS isn’t perfect.

So we add another layer:

results = ranker.rerank(rerank_request)
Enter fullscreen mode Exit fullscreen mode

🧠 What’s happening here?

  1. FAISS retrieves top 10 chunks
  2. Re-ranker evaluates relevance
  3. Returns best TOP_K

📊 Debug visibility

print("--- Re-ranker Scores ---")
Enter fullscreen mode Exit fullscreen mode

Helps you:

  • Understand ranking
  • Debug results

💬 Step 6: Streaming Responses (UX Upgrade)


for chunk in generate_answer(user_query, context_llm):
    print(chunk['response'], end='', flush=True)
Enter fullscreen mode Exit fullscreen mode

🧠 Why this matters

  • Feels real-time
  • Improves perceived speed
  • Better experience

🔁 Final System (Let’s Visualize It)

👉 This is what your ChatPDF system looks like now:

Rag Pipeline


🧠 What this diagram represents

  • Query → converted into embedding
  • FAISS → retrieves relevant chunks
  • Re-ranker → improves quality
  • LLM → generates final answer

👉 This is a complete RAG pipeline


🚀 What We Achieved

Feature Part 1 (NumPy) Part 2 (FAISS)
Search Brute force Indexed ⚡
Speed Slow Fast
Persistence
Accuracy Basic Improved
UX Basic Streaming

🧠 Final Thoughts

This is where things became real.

From “I understand RAG”
to
“I can build something scalable”


If you’re learning RAG:

  • Start with NumPy ✅
  • Move to FAISS ✅

That transition is where the real understanding happens.


📂 Project Repo

👉 https://github.com/SharathKurup/chatPDF/tree/faiss_indexing


🔜 What’s Next?

In Part 3:

👉 We’ll build a Streamlit UI
👉 Turn this into a proper app


💬 Let’s Connect

If you're building something similar or exploring local LLMs, I’d love to hear your thoughts 👇

Top comments (0)