How I Built My Own RAG Chatbot with Local LLMs (And the Roadblocks That Taught Me More Than the Code)

A while back, I wrote a beginner-to-expert guide on Retrieval-Augmented Generation (RAG). That article was all theory. How RAG works, the difference between sparse and dense embeddings, and why it’s powerful.

This time, I wanted to get my hands dirty. I wanted to build something real.

So I built a working RAG chatbot. Completely offline. Locally.

Let me walk you through the full journey what I built, how it works, and what went wrong (and how I fixed it).

Why I Ran It Locally

This wasn’t about saving money or staying private. It was about learning raw, hands-on, deep learning.

I didn’t want to just connect APIs and feel like a builder. I wanted to:

• Understand how text becomes vectors
• Debug retrieval when it breaks
• Run a model myself and see how it responds

I wanted to learn the hard way and local was the best way to make sure I did.

My Project: PDF Q&A Chatbot (All Offline)

I had one clear goal: Ask questions from a PDF and get meaningful answers without internet.

I used a document called Evolution_of_AI.pdf. I asked questions like:

"What are the phases in AI development?"

The chatbot searched the PDF, found the right section, fed it to a local LLM, and gave me a perfect answer.

All offline.

System Design Diagram: How Offline RAG Chatbot Works

Here’s the process:

User sends a question to the RAG chatbot.
The chatbot uses PyPDFLoader to load the PDF.
It splits the text using RecursiveCharacterTextSplitter.
Text chunks are converted to vectors via HuggingFace Embeddings.
Vectors are stored and retrieved using FAISS.
The top relevant chunks are passed to a local LLM via Ollama.
The final answer is shown to the user.

Tech Stack I Used

PyPDFLoader: Used for extracting raw text from the PDF so the bot can "read" it.

RecursiveCharacterTextSplitter: It ensures that even long paragraphs are broken into manageable, overlapping pieces that preserve meaning.

HuggingFaceEmbeddings: Converts those text chunks into number lists (vectors) that reflect context, not just words.

FAISS: A lightning-fast search tool that finds which vectors (chunks) are closest to the question vector.

Ollama: Runs lightweight models like phi on your machine, no cloud needed.

LangChain: The backbone. It handles all connections from question to document to model and back.

The Hidden Struggles and My Fixes

Empty Answers or Garbage Output
My initial PDF had just one sentence not enough for meaningful retrieval.
Fix: I created a structured PDF (Evolution_of_AI.pdf) with real content.
Wrong Chunks Being Retrieved
Asked about AI phases, but got results about NLP techniques.
Fix: Added more chunk overlap, changed embedding model, and tagged the chunks with extra metadata.
Deprecation Warnings in LangChain
The .run() method stopped working.
Fix: Switched to the .invoke() method per latest LangChain docs.
Ollama Crashes with Heavy Models
Running models like Mistral overloaded my RAM.
Fix: Downgraded to phi, a lighter model that worked well locally.
No Change After Updating PDF
I changed the PDF but still got answers from the old one.
Fix: I cleared the FAISS index and re-embedded everything.
Short or Vague Queries Confused the Bot
“Phases?” returned irrelevant content.
Fix: I used prompt templates to expand such queries into full sentences automatically.

Technical Bits Explained Simply

Chunking
Breaks large documents into overlapping sections so important parts aren’t lost during processing.

Embeddings
Turns sentences into numbers that represent meaning. That way, "vacation" and "holiday" look nearly the same to the machine.

Cosine Similarity
A math trick to check how similar two vectors (questions and chunks) are. Smaller angle = better match.

FAISS
A tool that finds which chunks are most similar to the question super quickly.

LangChain

LangChain simplifies the complex plumbing. It:

• Takes your question
• Converts it to a vector
• Finds the most relevant document chunks via FAISS
• Sends it all to the LLM
• Collects and returns the final answer

All without you needing to manually stitch the logic together.

Evaluation Techniques I Used

Manually compared answers with the PDF
Asked intentionally vague or tricky questions
Checked that the answers didn’t hallucinate
Made sure important info wasn’t skipped (avoided the “lost in the middle” issue)

Any Frontend?

Not yet, but I’m planning:

A Streamlit-based UI for chatting with the bot
A FastAPI backend to make it modular
A desktop wrapper so anyone can use it easily

What’s Next?

Multi-PDF support
Chunk summaries for quick previews
Using ragas for automated evaluation
Feedback-based learning loop

Interview Questions

How does chunk overlap affect retrieval quality in RAG systems?
What are the benefits of local embeddings over API-based ones?
How do you debug wrong or missing retrievals in vector search?
What’s the trade-off between dense and sparse embeddings?
How do you handle stale or outdated indexes in a vector DB like FAISS?

Final Thoughts

Building this RAG chatbot wasn’t just about code it was about transforming theory into practice. Every bug I fixed and every wrong answer I debugged helped me grow.

If you’ve read about RAG and want to really learn it build something.

Let’s keep learning, building, and breaking things together.

Shaik Salma Aga

[🔗 GitHub: https://github.com/ShaikSalmaAga/rag-chatbot](url)