DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at runaihome.com

Local RAG in 2026: Build a Private Document AI That Never Leaves Your Machine

This article was originally published on runaihome.com

RAG (Retrieval-Augmented Generation) is the difference between asking your LLM questions about its training data and asking it questions about your documents. With a local setup, every PDF, contract, manual, or research paper you upload stays on your hardware — no API call to OpenAI, no document upload to a third-party server, no data retention policy to read.

Three paths covered here: no-code with Open WebUI, no-code with AnythingLLM, and a Python pipeline for developers who want full control. All three run entirely offline once set up.

How RAG actually works

When a cloud service answers questions about a document, one of two things happens: either the whole document is stuffed into the context window (expensive, limited by window size), or it's chunked, embedded into a vector database, and the most relevant chunks are retrieved on each query. The second approach is RAG.

Local RAG runs every component on your machine:

  1. Ingest: Your document gets split into fixed-size chunks (typically 512 tokens)
  2. Embed: Each chunk is converted to a vector by an embedding model
  3. Store: Vectors are saved to a local vector database
  4. Retrieve: Your question is embedded, matched against stored vectors, top-k most similar chunks are selected
  5. Answer: The LLM answers using only the retrieved context injected into its prompt

The privacy implication: your documents never leave your machine. Embedding happens locally, retrieval happens locally, and the LLM runs locally. Compare this against what common tools actually phone home — the gap is significant.

Pick your embedding model first

The embedding model determines retrieval quality and is the first decision to make. These three run via Ollama, and the top option runs fine on CPU with no GPU required:

Model Size Params Context MTEB Score Best for
nomic-embed-text v1.5 274 MB 137M 8,192 tokens 62.39 General use, CPU-only machines
mxbai-embed-large 670 MB 334M 512 tokens 64.68 Higher accuracy, short document chunks
snowflake-arctic-embed2 ~600 MB 303M 8,192 tokens Competitive MTEB-R Multilingual documents

For context on those MTEB scores: nomic-embed-text at 62.39 matches OpenAI's text-embedding-3-small (62.3). mxbai-embed-large at 64.68 matches OpenAI's text-embedding-3-large (64.6). Both run locally at zero marginal cost.

The mxbai-embed-large caveat: its 512-token context window means any chunk longer than roughly 380 words gets truncated. If your documents have dense, long paragraphs, nomic-embed-text's 8,192-token context handles them cleanly. mxbai-embed-large wins on accuracy for short, well-segmented content.

Pull whichever you're starting with:

ollama pull nomic-embed-text
# or for higher accuracy:
ollama pull mxbai-embed-large
Enter fullscreen mode Exit fullscreen mode

Embedding models are separate from chat models in Ollama — you need both pulled before any RAG pipeline works.

Path 1: Open WebUI — zero config, browser-based

If you already have Open WebUI running with Ollama, RAG is a few settings changes away. If you haven't set it up yet, the full setup walkthrough is at /blog/open-webui-multi-user-auth-family-setup-2026/.

Step 1 — Configure the embedding model:

Admin Panel → Settings → Documents:

  • Embedding Model Engine: Ollama
  • Embedding Model: nomic-embed-text
  • Chunk Size: 512
  • Chunk Overlap: 64
  • Hybrid Search: toggle on (this blends vector similarity with keyword matching, improving recall for specific terms like product names or version numbers)
  • Save

Step 2 — Fix Ollama's default context length (critical):

Ollama defaults to a 2,048-token context window, which silently drops retrieved chunks that fall outside it. For RAG to work well, you need at least 8,192.

Admin Panel → Models → select your chat model → Advanced Parameters → set num_ctx to 8192. For long documents with many retrieved chunks, push this to 16384.

Step 3 — Create a knowledge base:

Workspace → Knowledge → + New Knowledge → give it a name (e.g., "Product Manuals") → upload files. Open WebUI processes documents asynchronously; wait for the spinner to clear before querying.

Supported formats as of 2026: PDF, DOCX, TXT, Markdown, CSV. Complex DOCX formatting (tracked changes, nested tables) can lose fidelity — plain text and PDF are the most reliable.

Step 4 — Use it in chat:

In a new chat session, type # and the knowledge collection name appears as an autocomplete option. Select it to attach to the session. Every query now retrieves from your indexed documents before the LLM responds.

One limitation to know: if you change your chunk size or embedding model after documents are already indexed, existing documents in knowledge bases retain their original chunking. New uploads use the updated settings. You'd need to delete and re-upload existing files to re-index them with new settings.

Path 2: AnythingLLM — desktop app, no terminal

AnythingLLM is a desktop application built specifically for document chat. It bundles its own vector database (LanceDB), chunking logic, and a GUI for every step — no Docker, no terminal, drag-and-drop documents. As of May 2026 it has 53,000+ GitHub stars and is actively maintained.

The app itself needs roughly 2 GB RAM. Running a local LLM alongside it requires whatever your chosen Ollama model needs separately.

Install and connect to Ollama:

Download from useanything.com — the installer is around 500 MB. On first launch:

  1. Settings → LLM Preference → Ollama. The app auto-detects localhost:11434
  2. Select your chat model (Qwen2.5 7B for a balance of speed and quality; Llama 3.2 3B if you're on a low-VRAM machine)
  3. Settings → Embedding Preference → Ollama → select nomic-embed-text
  4. Save and close settings

Create a workspace and upload documents:

Workspaces are the unit of organization — a project folder, chat history, and document collection in one. Click + New Workspace, name it, then drag and drop PDFs into the document panel. AnythingLLM chunks and embeds automatically. When the spinner clears, the documents are queryable.

The workspace isolation model is better than Open WebUI for multi-project use: documents in Workspace A are invisible to Workspace B. If you're running separate projects — client work, personal research, a codebase's documentation — this prevents cross-contamination in retrieval.

The trade-off: AnythingLLM's default chunk size is on the larger side. For documents where you're looking up specific numbers or dates, reducing the chunk size in Settings → Embedder → Chunk Configuration improves precision at the cost of needing more retrieved chunks to cover the same context.

Path 3: Python with LangChain + Ollama

For developers building applications or needing full control over the pipeline — custom preprocessing, re-ranking, hybrid retrieval, or integration into existing code.

Install dependencies:

pip install langchain langchain-ollama langchain-community faiss-cpu pypdf
Enter fullscreen mode Exit fullscreen mode

Build the pipeline:


python
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

# Load all PDFs from a directory
loader = PyPDFDirectoryLoader("./docs/")
documents = loader.load()

# Chunk: 512 tokens, 64-token overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)

# Embed locally with Ollama — 768-dimensional vectors
embeddings = OllamaEmbeddings(model="
Enter fullscreen mode Exit fullscreen mode

Top comments (0)