DEV Community

Sharath Kurup
Sharath Kurup

Posted on

Understanding RAG by Building a ChatPDF App with NumPy (Part 1)

🧠 Building a Chat with PDF App (From Scratch using NumPy) – Part 1

Turning a simple PDF into a conversational AI system using local LLMs πŸš€


πŸ“Œ Introduction

Have you ever wanted to chat with your PDF documents like you chat with ChatGPT?

In this series, I’ll walk you through building a ChatPDF application from scratch, starting from the absolute basics and gradually improving it into a production-ready system.

πŸ‘‰ In this first part, we’ll build a naive RAG (Retrieval-Augmented Generation) system using only NumPy β€” no FAISS, no vector databases, just pure fundamentals.


🎯 What We’ll Build

By the end of this article, you’ll have:

  • πŸ“„ A system that reads a PDF
  • βœ‚οΈ Splits it into meaningful chunks
  • πŸ”’ Converts text into embeddings using a local model
  • πŸ” Searches relevant content using vector similarity
  • πŸ’¬ Generates answers using an LLM

βš™οΈ Tech Stack

  • pdfplumber β†’ Extract text from PDFs
  • numpy β†’ Perform vector similarity search
  • ollama β†’ Run local embedding + LLM models

🧩 How It Works (High Level)

Our pipeline looks like this:

PDF β†’ Text β†’ Chunks β†’ Embeddings β†’ Similarity Search β†’ LLM β†’ Answer
Enter fullscreen mode Exit fullscreen mode

πŸ“₯ Step 1: Reading the PDF

We start by extracting text page by page:

def readpdf():
    all_texts = []
    with pdfplumber.open(PDF_PATH) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text() or ""
            if not text.strip():
                continue
            all_texts.append((i + 1, text))
    return all_texts
Enter fullscreen mode Exit fullscreen mode

🧠 What’s happening?

  • Reads each page
  • Skips empty pages
  • Stores (page_number, text)

βœ‚οΈ Step 2: Chunking the Text

Large text doesn’t work well for embeddings or LLMs, so we split it:

def generate_chunks(text, page_num):
    chunks = []
    i = 0
    while i < len(text):
        end = min(i + CHUNK_SIZE, len(text))
        chunk = text[i:end]

        if end < len(text):
            last_space = chunk.rfind(" ")
            if last_space != -1:
                end = i + last_space
                chunk = text[i:end]

        chunks.append({"text": chunk.strip(), "page": page_num})

        i = end - OVERLAP_SIZE
Enter fullscreen mode Exit fullscreen mode

🧠 Why overlap?

  • Prevents context loss between chunks
  • Helps LLM understand continuity

πŸ”’ Step 3: Generating Embeddings

We convert text into vectors using Ollama:

def generate_embeddings_batch(texts):
    all_embeddings = []
    for i in range(0, len(texts), BATCH_SIZE):
        batch_texts = texts[i:i+BATCH_SIZE]
        response = ollama.embed(model=EMBED_MODEL, input=batch_texts)
        all_embeddings.extend(response["embeddings"])
    return all_embeddings
Enter fullscreen mode Exit fullscreen mode

🧠 Why batching?

  • Faster processing
  • Efficient use of resources

πŸ“ Step 4: Similarity Search (Core Logic)

Here’s where NumPy shines:

similarities = np.dot(vector_db, query_vector)
top_indices = np.argsort(similarities)[-TOP_K:][::-1]
Enter fullscreen mode Exit fullscreen mode

🧠 What’s happening?

  • We compute dot product similarity
  • Higher score = more relevant chunk
  • Select top K results

πŸ‘‰ This is essentially a manual vector database using NumPy


πŸ’¬ Step 5: Generate Answer using LLM

We pass retrieved chunks as context:

def generate_answer(query, chunks):
    context_chunks = "\n\n".join(chunks)
    prompt = f"""
Context:
{context_chunks}

Question:
{query}

Answer:
"""
    response = ollama.generate(model=THINKING_MODEL, prompt=prompt)
    return response["response"]
Enter fullscreen mode Exit fullscreen mode

🧠 Key Idea

We’re doing RAG (Retrieval-Augmented Generation):

  • Retrieval β†’ relevant chunks
  • Generation β†’ LLM response

πŸ” Step 6: Interactive Chat Loop

def chat_pdf(vector_db, text_metadata):
    while True:
        user_query = input("You - ")
        results = search(user_query, vector_db, text_metadata)

        context_llm = [res["text"] for res in results]
        response = generate_answer(user_query, context_llm)

        print(response)
Enter fullscreen mode Exit fullscreen mode

Now you can literally:

You - What is the main topic?
AI  - ...
Enter fullscreen mode Exit fullscreen mode

πŸ” Bonus: Embedding Normalization Check

norms = np.linalg.norm(embeddings_array, axis=1)
Enter fullscreen mode Exit fullscreen mode

🧠 Why this matters?

  • If vectors are normalized β†’ dot product β‰ˆ cosine similarity
  • Improves consistency in search results

🚨 Limitations of This Approach

This implementation is intentionally simple β€” and that comes with trade-offs:

⚠️ 1. Slow Search for Large PDFs

  • NumPy scans every vector
  • No indexing β†’ O(n) search

⚠️ 2. Not Scalable

  • Works fine for small docs
  • Breaks down with:

    • Large PDFs
    • Multiple documents

⚠️ 3. No Persistent Storage

  • Embeddings are generated every run
  • No caching or database

⚠️ 4. Limited Retrieval Quality

  • Pure similarity search
  • No reranking, filtering, or hybrid search

⚠️ 5. Context Limitation

  • Only TOP_K chunks used
  • May miss important information

🧠 What You Learned

  • How RAG works under the hood
  • How embeddings enable semantic search
  • How to build a vector search using NumPy
  • How LLMs use context to answer questions

πŸ”œ What’s Next?

In Part 2, we’ll upgrade this system by replacing NumPy search with:

➑️ FAISS (Facebook AI Similarity Search)

This will give us:

  • ⚑ Faster retrieval
  • πŸ“ˆ Better scalability
  • 🧠 Efficient indexing

πŸ“‚ Project Repo

πŸ‘‰ GitHub: https://github.com/SharathKurup/chatPDF/tree/numpy_vector


πŸ’¬ Final Thoughts

This is the most important step in understanding RAG systems:

Before using fancy tools like FAISS or vector DBs,
you should understand what’s happening underneath.

Once you get this, everything else becomes easy.


If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts πŸ‘‡


Stay tuned for Part 2 πŸš€

Top comments (0)