How I Learned AI by Building an Offline PDF Chatbot with Local LLMs

Hey everyone! I’m Shaik Salma Aga, and I love learning by building. Instead of just reading theory, I built something that helped me understand AI practically and also prepare better for my interviews.

Let me walk you through what I built, how it works, and how you can try it too.

Project Goal: Learning AI by Building, Not Just Reading

I didn’t just want to use AI tools. I wanted to build one from scratch and see what happens under the hood.

I was exploring concepts like embeddings, vector search, and local LLMs but theory alone wasn’t sticking. So I built this project an Offline PDF Analyzer to learn how documents are split, embedded, searched, and how local models generate smart responses.

This project became my practical journey into AI and now it helps others too, especially those preparing for interviews.

What This Project Does

Upload one or more PDFs through a simple web UI.
Ask your questions in simple English.
The system reads and understands the content, then gives you a relevant answer from the document.
Everything runs locally no internet or API keys needed.
It can also count how many questions are in the PDF useful for exam prep.

Example:

Upload a PDF on Machine Learning and ask: “What is the difference between supervised and unsupervised learning?”
You get a clear, to-the-point answer pulled directly from the relevant section of the document instantly.

How It Works

Below is the complete flow of how the Offline PDF Analyzer works behind the scenes:

PDF Upload: The user uploads one or more PDF files through the UI.
Text Extraction: The app reads all pages using PyMuPDF and extracts clean text.
Chunking: Long text is split into overlapping chunks using RecursiveCharacterTextSplitter to preserve context.
Embeddings: Each chunk is converted into a vector (a list of numbers) using OllamaEmbeddings.
FAISS Vector Search: When a question is asked, similar chunks are searched using fast cosine similarity.
Answer Generation: The selected chunks are passed to a local LLM (like phi, mistral, or llama2) to generate the final answer.

Choose Your Local AI Model

You can select models like phi, mistral, or llama2 all running locally on your laptop using Ollama for fast and efficient results.

System Design Diagram: How PDF Analyzer Works

Tech Stack I Used

Streamlit: For building a user-friendly frontend with just a few lines of Python.
PyMuPDF (fitz): To extract text from all pages of uploaded PDFs.
LangChain: To handle end-to-end chaining from query to retrieval to LLM response.
RecursiveCharacterTextSplitter: Breaks the text into chunks with overlaps, so context is preserved.
Ollama: Runs local LLMs (phi, mistral, llama2) directly on your machine without internet.
FAISS: A super-fast vector search library to retrieve relevant chunks.
Python: For the backend logic, caching, state management, and pre-processing.

Challenges I Faced & How I Solved Them

1. Wrong Answers from Wrong Sections

In the beginning, it showed answers from the wrong part of the PDF, which didn’t match the question and made things confusing.
Fix: I adjusted the chunk overlap size, used better metadata like page numbers and source file names, and added tagging.

2. Answers Coming from Previous PDF.

Even after uploading a different PDF, it still showed answers from the old one.
Fix: I added file hashing to detect newly uploaded PDFs. If the incoming file is different from the previous one, the system discards the old data and processes the new file from scratch.

3. Short Queries Gave Confusing Answers

If I typed "types?" or "examples?", the app didn’t understand what I meant.
Fix: I made a way to automatically turn short questions into full ones. For example, if someone types "types?", it changes to "What are the different types mentioned in the document?" so the model understands better.

4. No Info on Where the Answer Came From

I wasn’t sure if the answer was right because it didn’t show where in the PDF it found the info.
Fix: Now it shows the PDF name and page number where the answer came from, and you can click to see more details if you want.

Techniques I Used

@st.cache_data: To avoid reloading the same PDF again and again.
File Hashing: So that the app resets only when a new PDF is uploaded.
Session State: Used in Streamlit to store user-uploaded files and questions.
Regex Matching: To support question formats like “How many questions are in this PDF?”
Prompt Templates: Help the model understand and answer better when the user's question is short or unclear.

Any Frontend?

Yes! I made a clean and user-friendly interface using Streamlit that makes it easy to upload PDFs and get answers quickly.

Choose your preferred LLM (phi / mistral / llama2)
Upload one or more PDFs
Ask your question
See the answer + source (page number + filename)

No delays, no registration everything happens on your own system.

What’s Next?

Here’s what I want to add next:

PDF Summarizer: Get a quick summary of the whole PDF.
Export Chat History: Save your Q&A for later.
Find All Questions: List all questions found inside the PDF.

Tech Terms

Chunking: Breaking a big document into small, readable parts.
Embeddings: Turning text into numbers so that the model understands meaning.
FAISS: Finds the best match for your question from the chunks.
Local LLMs: Small AI models running on your laptop (no internet needed).
LangChain: Connects everything PDFs, questions, answers — in one neat pipeline.

Interview Questions You Can Expect

How does chunk overlap affect retrieval quality?
What’s the role of FAISS in a RAG pipeline?
Why are prompt templates useful in real-world applications?
How do you make vector indexes update-safe when files change?
What are the trade-offs of using local LLMs vs cloud APIs?

Final Thoughts

This project started as a way to learn AI deeply by building something useful. It taught me how to use embeddings, vector search, local LLMs, and chaining tools all while helping me with interview prep.

If you want to learn by doing start small, build real, and break things.

Let’s keep learning. Let’s keep building.

✍️Shaik Salma Aga

🔗 GitHub Repo