Lingdas1

Posted on May 23 • Originally published at github.com

Local RAG: Chat With Your Documents (Open Source, Private)

#rag #llm #opensource #tutorial

Local RAG: Chat With Your Documents

Upload PDFs, code, research papers, or entire books — then ask your local LLM questions about them. No data ever leaves your machine.

What Is RAG? (Plain English)

RAG (Retrieval-Augmented Generation) means your LLM can look up information from your own documents before answering.

Think of it like this:

Normal LLM: Has a great memory, but only knows what it learned during training
RAG: The LLM gets a "cheat sheet" — your documents — that it can read before answering

💡 Analogy: Without RAG, the LLM is like a student taking a closed-book exam. With RAG, they get an open-book exam — and you get to write the book.

Real-World Uses

Use Case	What You Upload	What You Can Ask
Research	PDF papers, articles	"What were the key findings in this study?"
Studying	Textbooks, lecture notes	"Explain chapter 7 in simpler terms"
Work	Company docs, reports	"What's our Q3 strategy?"
Legal	Contracts, agreements	"What are the termination clauses?"
Coding	Codebase, documentation	"How does the auth module work?"
Personal	Journals, notes, books	"What did I write about in March?"

Option A: Built-in RAG in Open WebUI (Simplest)

If you already have Open WebUI installed, RAG is built-in.

How to Use It

Open http://localhost:3000 in your browser
Click the paperclip icon next to the chat input
Upload a PDF, .txt, .docx, or .md file
Wait for the "embedding" process to finish (usually 10-30 seconds)
Ask questions about the document

That's it. No configuration needed.

Pro Tips

Multiple documents: You can upload several files at once. Open WebUI indexes them all.
Model choice: Use qwen3.6:27b or deepseek-r1:14b for best RAG quality — they have larger context windows.
Document size: Open WebUI handles documents up to hundreds of pages. For very large documents, consider chunking them.

Option B: AnythingLLM (More Powerful)

AnythingLLM is a dedicated RAG application with more features than Open WebUI's built-in system.

Installation

With Docker (Recommended):

docker run -d \
  -p 3001:3001 \
  -v anythingllm:/app/server/storage \
  -e STORAGE_DIR=/app/server/storage \
  --name anythingllm \
  --restart always \
  ghcr.io/anythingllm/anything-llm:latest

Then open http://localhost:3001.

Without Docker:

Download from anythingllm.com and run the installer for your OS.

Configuration

Open AnythingLLM at http://localhost:3001
Create an admin account (local only — no data leaves your machine)
Go to Settings → LLM Provider
Select Ollama from the dropdown
Choose your model (e.g., qwen2.5:7b or deepseek-r1:14b)
Click Save

Now set up embeddings:

Go to Settings → Embedding Provider
Select Ollama
Choose an embedding model (AnythingLLM will download a small embedding model — about 500 MB)
Click Save

Uploading Documents

Click "New Workspace" and give it a name (e.g., "Research Papers")
Click the upload icon (or drag and drop files)
Supported formats: PDF, DOCX, TXT, MD, CSV, JSON, code files
Click "Save and Embed"
Wait for indexing (progress shows in the UI)

Chatting With Your Documents

Once embedded, just type your question:

"What are the three main conclusions from these papers?"

AnythingLLM searches your documents for relevant passages and feeds them to the LLM along with your question. The result is accurate, sourced answers — not guesses.

🔥 Pro tip: AnythingLLM shows you which document each answer came from. Hover over the citation to see the exact source passage.

Option C: Manual RAG with LangChain (For Developers)

For maximum control, build RAG with Python and LangChain. This is particularly useful if you want to automate document processing.

Setup

pip install langchain langchain-ollama chromadb

Basic RAG Script

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# 1. Load your documents
loader = DirectoryLoader("./my-docs/", glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

# 3. Create embeddings and vector store
embeddings = OllamaEmbeddings(model="qwen2.5:7b")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create RAG chain
llm = ChatOllama(model="qwen2.5:7b", temperature=0.3)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

# 5. Ask questions
while True:
    question = input("\nAsk a question (or 'quit'): ")
    if question.lower() == 'quit':
        break
    answer = qa_chain.invoke(question)
    print(f"\nAnswer: {answer['result']}")

Run It

# Put your documents in a folder called "my-docs/"
mkdir -p my-docs
# Copy your PDFs/txts there

# Run the script
python rag.py

Choosing the Right RAG Setup

Factor	Open WebUI RAG	AnythingLLM	LangChain
Setup time	1 click	5 minutes	30 minutes
Features	Basic	Advanced	Full control
Document types	PDF, TXT, MD	PDF, DOCX, TXT, MD, CSV, code	Anything with a loader
Multi-document	✅	✅	✅
Citations	❌	✅	✅ (manual)
Customization	Low	Medium	High
Best for	Quick personal use	Serious knowledge work	Automation & production

My recommendation:

Start with Open WebUI's built-in RAG (fastest)
Move to AnythingLLM when you need citations and multiple workspaces
Use LangChain when you need to automate document processing

Best Practices for Better RAG Results

1. Use the Right Model

RAG works best with models that have large context windows:

Model	Context	Why It's Good for RAG
Qwen 3.6:27B	262K	Can process entire chapters at once
Qwen 2.5:14B	128K	Excellent balance of quality and speed
DeepSeek-R1:14B	128K	Best for reasoning about documents
DeepSeek-R1:32B	128K	Best overall RAG quality

2. Write Good Questions

❌ Bad Question	✅ Good Question
"Tell me about it"	"Summarize the methodology used in section 3"
"What's in this?"	"What are the three main arguments presented in chapter 2?"
"Is this useful?"	"What evidence does the author provide for their claim on page 15?"

3. Optimize Chunk Size

The chunk size determines how much text the LLM sees at once:

Chunk Size	Best For
500 chars	Short lookup questions ("What is X?")
1000 chars	General Q&A 🟢 Default
2000 chars	Summarization tasks
4000+ chars	Long-context analysis (Qwen 3.6 recommended)

Common Pitfalls

Problem	Cause	Fix
"I don't know" to document questions	Embedding not matching	Re-save documents in workspace
Wrong answers despite having docs	Chunk size too small	Increase chunk_size to 2000+
Very slow document processing	Large files on CPU	Be patient — first embed takes longest
"Model not responding"	Context overflow	Use a model with larger context (Qwen 3.6)
Can't upload PDFs	PDF is scanned/image-based	Use OCR first (tools like marker-pdf)

Next Steps

Set up Open WebUI first (it includes RAG out of the box) → Open WebUI Guide
Try it with Chinese models → Qwen 3.6 is excellent for RAG due to its 262K context
Combine RAG with Function Calling → Chapter 06: Function Calling
Deploy in production → Chapter 05: Production

Part of the Local LLM Guide — the definitive resource for running AI on your own hardware.

DEV Community