This article was originally published on runaihome.com
RAG (Retrieval-Augmented Generation) is the difference between asking your LLM questions about its training data and asking it questions about your documents. With a local setup, every PDF, contract, manual, or research paper you upload stays on your hardware — no API call to OpenAI, no document upload to a third-party server, no data retention policy to read.
Three paths covered here: no-code with Open WebUI, no-code with AnythingLLM, and a Python pipeline for developers who want full control. All three run entirely offline once set up.
How RAG actually works
When a cloud service answers questions about a document, one of two things happens: either the whole document is stuffed into the context window (expensive, limited by window size), or it's chunked, embedded into a vector database, and the most relevant chunks are retrieved on each query. The second approach is RAG.
Local RAG runs every component on your machine:
- Ingest: Your document gets split into fixed-size chunks (typically 512 tokens)
- Embed: Each chunk is converted to a vector by an embedding model
- Store: Vectors are saved to a local vector database
- Retrieve: Your question is embedded, matched against stored vectors, top-k most similar chunks are selected
- Answer: The LLM answers using only the retrieved context injected into its prompt
The privacy implication: your documents never leave your machine. Embedding happens locally, retrieval happens locally, and the LLM runs locally. Compare this against what common tools actually phone home — the gap is significant.
Pick your embedding model first
The embedding model determines retrieval quality and is the first decision to make. These three run via Ollama, and the top option runs fine on CPU with no GPU required:
| Model | Size | Params | Context | MTEB Score | Best for |
|---|---|---|---|---|---|
| nomic-embed-text v1.5 | 274 MB | 137M | 8,192 tokens | 62.39 | General use, CPU-only machines |
| mxbai-embed-large | 670 MB | 334M | 512 tokens | 64.68 | Higher accuracy, short document chunks |
| snowflake-arctic-embed2 | ~600 MB | 303M | 8,192 tokens | Competitive MTEB-R | Multilingual documents |
For context on those MTEB scores: nomic-embed-text at 62.39 matches OpenAI's text-embedding-3-small (62.3). mxbai-embed-large at 64.68 matches OpenAI's text-embedding-3-large (64.6). Both run locally at zero marginal cost.
The mxbai-embed-large caveat: its 512-token context window means any chunk longer than roughly 380 words gets truncated. If your documents have dense, long paragraphs, nomic-embed-text's 8,192-token context handles them cleanly. mxbai-embed-large wins on accuracy for short, well-segmented content.
Pull whichever you're starting with:
ollama pull nomic-embed-text
# or for higher accuracy:
ollama pull mxbai-embed-large
Embedding models are separate from chat models in Ollama — you need both pulled before any RAG pipeline works.
Path 1: Open WebUI — zero config, browser-based
If you already have Open WebUI running with Ollama, RAG is a few settings changes away. If you haven't set it up yet, the full setup walkthrough is at /blog/open-webui-multi-user-auth-family-setup-2026/.
Step 1 — Configure the embedding model:
Admin Panel → Settings → Documents:
- Embedding Model Engine: Ollama
-
Embedding Model:
nomic-embed-text - Chunk Size: 512
- Chunk Overlap: 64
- Hybrid Search: toggle on (this blends vector similarity with keyword matching, improving recall for specific terms like product names or version numbers)
- Save
Step 2 — Fix Ollama's default context length (critical):
Ollama defaults to a 2,048-token context window, which silently drops retrieved chunks that fall outside it. For RAG to work well, you need at least 8,192.
Admin Panel → Models → select your chat model → Advanced Parameters → set num_ctx to 8192. For long documents with many retrieved chunks, push this to 16384.
Step 3 — Create a knowledge base:
Workspace → Knowledge → + New Knowledge → give it a name (e.g., "Product Manuals") → upload files. Open WebUI processes documents asynchronously; wait for the spinner to clear before querying.
Supported formats as of 2026: PDF, DOCX, TXT, Markdown, CSV. Complex DOCX formatting (tracked changes, nested tables) can lose fidelity — plain text and PDF are the most reliable.
Step 4 — Use it in chat:
In a new chat session, type # and the knowledge collection name appears as an autocomplete option. Select it to attach to the session. Every query now retrieves from your indexed documents before the LLM responds.
One limitation to know: if you change your chunk size or embedding model after documents are already indexed, existing documents in knowledge bases retain their original chunking. New uploads use the updated settings. You'd need to delete and re-upload existing files to re-index them with new settings.
Path 2: AnythingLLM — desktop app, no terminal
AnythingLLM is a desktop application built specifically for document chat. It bundles its own vector database (LanceDB), chunking logic, and a GUI for every step — no Docker, no terminal, drag-and-drop documents. As of May 2026 it has 53,000+ GitHub stars and is actively maintained.
The app itself needs roughly 2 GB RAM. Running a local LLM alongside it requires whatever your chosen Ollama model needs separately.
Install and connect to Ollama:
Download from useanything.com — the installer is around 500 MB. On first launch:
- Settings → LLM Preference → Ollama. The app auto-detects
localhost:11434 - Select your chat model (Qwen2.5 7B for a balance of speed and quality; Llama 3.2 3B if you're on a low-VRAM machine)
- Settings → Embedding Preference → Ollama → select
nomic-embed-text - Save and close settings
Create a workspace and upload documents:
Workspaces are the unit of organization — a project folder, chat history, and document collection in one. Click + New Workspace, name it, then drag and drop PDFs into the document panel. AnythingLLM chunks and embeds automatically. When the spinner clears, the documents are queryable.
The workspace isolation model is better than Open WebUI for multi-project use: documents in Workspace A are invisible to Workspace B. If you're running separate projects — client work, personal research, a codebase's documentation — this prevents cross-contamination in retrieval.
The trade-off: AnythingLLM's default chunk size is on the larger side. For documents where you're looking up specific numbers or dates, reducing the chunk size in Settings → Embedder → Chunk Configuration improves precision at the cost of needing more retrieved chunks to cover the same context.
Path 3: Python with LangChain + Ollama
For developers building applications or needing full control over the pipeline — custom preprocessing, re-ranking, hybrid retrieval, or integration into existing code.
Install dependencies:
pip install langchain langchain-ollama langchain-community faiss-cpu pypdf
Build the pipeline:
python
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
# Load all PDFs from a directory
loader = PyPDFDirectoryLoader("./docs/")
documents = loader.load()
# Chunk: 512 tokens, 64-token overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)
# Embed locally with Ollama — 768-dimensional vectors
embeddings = OllamaEmbeddings(model="
Top comments (0)