π§ Building a Chat with PDF App (From Scratch using NumPy) β Part 1
Turning a simple PDF into a conversational AI system using local LLMs π
π Introduction
Have you ever wanted to chat with your PDF documents like you chat with ChatGPT?
In this series, Iβll walk you through building a ChatPDF application from scratch, starting from the absolute basics and gradually improving it into a production-ready system.
π In this first part, weβll build a naive RAG (Retrieval-Augmented Generation) system using only NumPy β no FAISS, no vector databases, just pure fundamentals.
π― What Weβll Build
By the end of this article, youβll have:
- π A system that reads a PDF
- βοΈ Splits it into meaningful chunks
- π’ Converts text into embeddings using a local model
- π Searches relevant content using vector similarity
- π¬ Generates answers using an LLM
βοΈ Tech Stack
-
pdfplumberβ Extract text from PDFs -
numpyβ Perform vector similarity search -
ollamaβ Run local embedding + LLM models
π§© How It Works (High Level)
Our pipeline looks like this:
PDF β Text β Chunks β Embeddings β Similarity Search β LLM β Answer
π₯ Step 1: Reading the PDF
We start by extracting text page by page:
def readpdf():
all_texts = []
with pdfplumber.open(PDF_PATH) as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text() or ""
if not text.strip():
continue
all_texts.append((i + 1, text))
return all_texts
π§ Whatβs happening?
- Reads each page
- Skips empty pages
- Stores
(page_number, text)
βοΈ Step 2: Chunking the Text
Large text doesnβt work well for embeddings or LLMs, so we split it:
def generate_chunks(text, page_num):
chunks = []
i = 0
while i < len(text):
end = min(i + CHUNK_SIZE, len(text))
chunk = text[i:end]
if end < len(text):
last_space = chunk.rfind(" ")
if last_space != -1:
end = i + last_space
chunk = text[i:end]
chunks.append({"text": chunk.strip(), "page": page_num})
i = end - OVERLAP_SIZE
π§ Why overlap?
- Prevents context loss between chunks
- Helps LLM understand continuity
π’ Step 3: Generating Embeddings
We convert text into vectors using Ollama:
def generate_embeddings_batch(texts):
all_embeddings = []
for i in range(0, len(texts), BATCH_SIZE):
batch_texts = texts[i:i+BATCH_SIZE]
response = ollama.embed(model=EMBED_MODEL, input=batch_texts)
all_embeddings.extend(response["embeddings"])
return all_embeddings
π§ Why batching?
- Faster processing
- Efficient use of resources
π Step 4: Similarity Search (Core Logic)
Hereβs where NumPy shines:
similarities = np.dot(vector_db, query_vector)
top_indices = np.argsort(similarities)[-TOP_K:][::-1]
π§ Whatβs happening?
- We compute dot product similarity
- Higher score = more relevant chunk
- Select top K results
π This is essentially a manual vector database using NumPy
π¬ Step 5: Generate Answer using LLM
We pass retrieved chunks as context:
def generate_answer(query, chunks):
context_chunks = "\n\n".join(chunks)
prompt = f"""
Context:
{context_chunks}
Question:
{query}
Answer:
"""
response = ollama.generate(model=THINKING_MODEL, prompt=prompt)
return response["response"]
π§ Key Idea
Weβre doing RAG (Retrieval-Augmented Generation):
- Retrieval β relevant chunks
- Generation β LLM response
π Step 6: Interactive Chat Loop
def chat_pdf(vector_db, text_metadata):
while True:
user_query = input("You - ")
results = search(user_query, vector_db, text_metadata)
context_llm = [res["text"] for res in results]
response = generate_answer(user_query, context_llm)
print(response)
Now you can literally:
You - What is the main topic?
AI - ...
π Bonus: Embedding Normalization Check
norms = np.linalg.norm(embeddings_array, axis=1)
π§ Why this matters?
- If vectors are normalized β dot product β cosine similarity
- Improves consistency in search results
π¨ Limitations of This Approach
This implementation is intentionally simple β and that comes with trade-offs:
β οΈ 1. Slow Search for Large PDFs
- NumPy scans every vector
- No indexing β O(n) search
β οΈ 2. Not Scalable
- Works fine for small docs
-
Breaks down with:
- Large PDFs
- Multiple documents
β οΈ 3. No Persistent Storage
- Embeddings are generated every run
- No caching or database
β οΈ 4. Limited Retrieval Quality
- Pure similarity search
- No reranking, filtering, or hybrid search
β οΈ 5. Context Limitation
- Only
TOP_Kchunks used - May miss important information
π§ What You Learned
- How RAG works under the hood
- How embeddings enable semantic search
- How to build a vector search using NumPy
- How LLMs use context to answer questions
π Whatβs Next?
In Part 2, weβll upgrade this system by replacing NumPy search with:
β‘οΈ FAISS (Facebook AI Similarity Search)
This will give us:
- β‘ Faster retrieval
- π Better scalability
- π§ Efficient indexing
π Project Repo
π GitHub: https://github.com/SharathKurup/chatPDF/tree/numpy_vector
π¬ Final Thoughts
This is the most important step in understanding RAG systems:
Before using fancy tools like FAISS or vector DBs,
you should understand whatβs happening underneath.
Once you get this, everything else becomes easy.
If you're building something similar or experimenting with local LLMs, Iβd love to hear your thoughts π
Stay tuned for Part 2 π
Top comments (0)