Sharath Kurup

Posted on Apr 8 • Edited on Apr 21

Understanding RAG by Building a ChatPDF App with NumPy (Part 1)

#ai #python #rag #tutorial

🧠 Building a Chat with PDF App (From Scratch using NumPy)

Turning a simple PDF into a conversational AI system using local LLMs 🚀

📌 Introduction

Have you ever wanted to chat with your PDF documents like you chat with ChatGPT?

In this series, I’ll walk you through building a ChatPDF application from scratch, starting from the absolute basics and gradually improving it into a production-ready system.

👉 In this first part, we’ll build a naive RAG (Retrieval-Augmented Generation) system using only NumPy — no FAISS, no vector databases, just pure fundamentals.

🎯 What We’ll Build

By the end of this article, you’ll have:

📄 A system that reads a PDF
✂️ Splits it into meaningful chunks
🔢 Converts text into embeddings using a local model
🔍 Searches relevant content using vector similarity
💬 Generates answers using an LLM

⚙️ Tech Stack

pdfplumber → Extract text from PDFs
numpy → Perform vector similarity search
ollama → Run local embedding + LLM models

🧩 How It Works (High Level)

Our pipeline looks like this:

PDF → Text → Chunks → Embeddings → Similarity Search → LLM → Answer

📥 Step 1: Reading the PDF

We start by extracting text page by page:

def readpdf():
    all_texts = []
    with pdfplumber.open(PDF_PATH) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text() or ""
            if not text.strip():
                continue
            all_texts.append((i + 1, text))
    return all_texts

🧠 What’s happening?

Reads each page
Skips empty pages
Stores (page_number, text)

✂️ Step 2: Chunking the Text

Large text doesn’t work well for embeddings or LLMs, so we split it:

def generate_chunks(text, page_num):
    chunks = []
    i = 0
    while i < len(text):
        end = min(i + CHUNK_SIZE, len(text))
        chunk = text[i:end]

        if end < len(text):
            last_space = chunk.rfind(" ")
            if last_space != -1:
                end = i + last_space
                chunk = text[i:end]

        chunks.append({"text": chunk.strip(), "page": page_num})

        i = end - OVERLAP_SIZE

🧠 Why overlap?

Prevents context loss between chunks
Helps LLM understand continuity

🔢 Step 3: Generating Embeddings

We convert text into vectors using Ollama:

def generate_embeddings_batch(texts):
    all_embeddings = []
    for i in range(0, len(texts), BATCH_SIZE):
        batch_texts = texts[i:i+BATCH_SIZE]
        response = ollama.embed(model=EMBED_MODEL, input=batch_texts)
        all_embeddings.extend(response["embeddings"])
    return all_embeddings

🧠 Why batching?

Faster processing
Efficient use of resources

📏 Step 4: Similarity Search (Core Logic)

Here’s where NumPy shines:

similarities = np.dot(vector_db, query_vector)
top_indices = np.argsort(similarities)[-TOP_K:][::-1]

🧠 What’s happening?

We compute dot product similarity
Higher score = more relevant chunk
Select top K results

👉 This is essentially a manual vector database using NumPy

💬 Step 5: Generate Answer using LLM

We pass retrieved chunks as context:

def generate_answer(query, chunks):
    context_chunks = "\n\n".join(chunks)
    prompt = f"""
Context:
{context_chunks}

Question:
{query}

Answer:
"""
    response = ollama.generate(model=THINKING_MODEL, prompt=prompt)
    return response["response"]

🧠 Key Idea

We’re doing RAG (Retrieval-Augmented Generation):

Retrieval → relevant chunks
Generation → LLM response

🔁 Step 6: Interactive Chat Loop

def chat_pdf(vector_db, text_metadata):
    while True:
        user_query = input("You - ")
        results = search(user_query, vector_db, text_metadata)

        context_llm = [res["text"] for res in results]
        response = generate_answer(user_query, context_llm)

        print(response)

Now you can literally:

You - What is the main topic?
AI  - ...

🔍 Bonus: Embedding Normalization Check

norms = np.linalg.norm(embeddings_array, axis=1)

🧠 Why this matters?

If vectors are normalized → dot product ≈ cosine similarity
Improves consistency in search results

🚨 Limitations of This Approach

This implementation is intentionally simple — and that comes with trade-offs:

⚠️ 1. Slow Search for Large PDFs

NumPy scans every vector
No indexing → O(n) search

⚠️ 2. Not Scalable

Works fine for small docs
Breaks down with:
- Large PDFs
- Multiple documents

⚠️ 3. No Persistent Storage

Embeddings are generated every run
No caching or database

⚠️ 4. Limited Retrieval Quality

Pure similarity search
No reranking, filtering, or hybrid search

⚠️ 5. Context Limitation

Only TOP_K chunks used
May miss important information

🧠 What You Learned

How RAG works under the hood
How embeddings enable semantic search
How to build a vector search using NumPy
How LLMs use context to answer questions

🔜 What’s Next?

In Part 2, we’ll upgrade this system by replacing NumPy search with:

➡️ FAISS (Facebook AI Similarity Search)

This will give us:

⚡ Faster retrieval
📈 Better scalability
🧠 Efficient indexing

📂 Project Repo

👉 GitHub: https://github.com/SharathKurup/chatPDF/tree/numpy_vector

💬 Final Thoughts

This is the most important step in understanding RAG systems:

Before using fancy tools like FAISS or vector DBs,
you should understand what’s happening underneath.

Once you get this, everything else becomes easy.

If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts 👇

Stay tuned for Part 2 🚀

DEV Community

Understanding RAG by Building a ChatPDF App with NumPy (Part 1)

🧠 Building a Chat with PDF App (From Scratch using NumPy)

📌 Introduction

🎯 What We’ll Build

⚙️ Tech Stack

🧩 How It Works (High Level)

📥 Step 1: Reading the PDF

🧠 What’s happening?

✂️ Step 2: Chunking the Text

🧠 Why overlap?

🔢 Step 3: Generating Embeddings

🧠 Why batching?

📏 Step 4: Similarity Search (Core Logic)

🧠 What’s happening?

💬 Step 5: Generate Answer using LLM

🧠 Key Idea

🔁 Step 6: Interactive Chat Loop

🔍 Bonus: Embedding Normalization Check

🧠 Why this matters?

🚨 Limitations of This Approach

⚠️ 1. Slow Search for Large PDFs

⚠️ 2. Not Scalable

⚠️ 3. No Persistent Storage

⚠️ 4. Limited Retrieval Quality

⚠️ 5. Context Limitation

🧠 What You Learned

🔜 What’s Next?

📂 Project Repo

💬 Final Thoughts

Top comments (0)