Build a RAG System with Python and a Local LLM (No API Costs)
RAG (Retrieval-Augmented Generation) is the most in-demand LLM skill in 2026. Every company wants to point an AI at their docs, their codebase, their knowledge base — and get useful answers back.
The typical stack involves OpenAI embeddings + GPT-4 + a vector DB. The typical bill involves a credit card.
Here's how to build the same thing entirely on local hardware: Python + Ollama + ChromaDB. No API keys. No per-token costs. Runs on a laptop or a home server.
What We're Building
A RAG pipeline that:
- Ingests documents (text files, markdown, PDFs)
- Embeds them using a local model
- Stores vectors in ChromaDB (local, in-memory or persistent)
- Retrieves relevant chunks on query
- Generates an answer using a local LLM via Ollama
Total cloud cost: $0.
Prerequisites
- Python 3.10+
- Ollama installed with at least one model pulled
- 8 GB RAM minimum (16 GB recommended for 14B models)
# Install dependencies
pip install chromadb ollama requests
# Pull models — one for embeddings, one for generation
ollama pull nomic-embed-text # Fast, purpose-built embedding model
ollama pull qwen2.5:14b # Generation model
Step 1: Document Ingestion
import os
import glob
from pathlib import Path
def load_documents(docs_dir: str) -> list[dict]:
"""
Load text documents from a directory.
Returns list of {content, source, chunk_id} dicts.
"""
documents = []
# Supported formats
patterns = ['**/*.txt', '**/*.md', '**/*.py', '**/*.rst']
for pattern in patterns:
for filepath in glob.glob(os.path.join(docs_dir, pattern), recursive=True):
try:
with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
if len(content.strip()) < 50:
continue # Skip tiny files
# Chunk the document
chunks = chunk_text(content, chunk_size=500, overlap=50)
for i, chunk in enumerate(chunks):
documents.append({
'content': chunk,
'source': filepath,
'chunk_id': f"{Path(filepath).stem}_{i}"
})
except Exception as e:
print(f"[warn] Skipping {filepath}: {e}")
print(f"[ingest] Loaded {len(documents)} chunks from {docs_dir}")
return documents
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks by word count."""
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk = ' '.join(words[i:i + chunk_size])
chunks.append(chunk)
i += chunk_size - overlap # Slide with overlap
return chunks
Step 2: Local Embeddings with Ollama
nomic-embed-text is a purpose-built embedding model — fast, small (274M params), and genuinely good at semantic similarity.
import ollama
def embed_texts(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
"""
Generate embeddings for a list of texts using Ollama.
Returns list of embedding vectors.
"""
embeddings = []
for i, text in enumerate(texts):
if i % 50 == 0:
print(f"[embed] Processing chunk {i}/{len(texts)}...")
response = ollama.embeddings(model=model, prompt=text)
embeddings.append(response['embedding'])
return embeddings
Step 3: Vector Storage with ChromaDB
import chromadb
from chromadb.config import Settings
def build_vector_store(
documents: list[dict],
embeddings: list[list[float]],
collection_name: str = "local_rag",
persist_dir: str = "./chroma_db"
) -> chromadb.Collection:
"""
Store document chunks and their embeddings in ChromaDB.
"""
client = chromadb.PersistentClient(path=persist_dir)
# Delete existing collection if rebuilding
try:
client.delete_collection(collection_name)
except Exception:
pass
collection = client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"} # Cosine similarity
)
# Batch insert
batch_size = 100
for i in range(0, len(documents), batch_size):
batch_docs = documents[i:i + batch_size]
batch_embeddings = embeddings[i:i + batch_size]
collection.add(
ids=[doc['chunk_id'] for doc in batch_docs],
embeddings=batch_embeddings,
documents=[doc['content'] for doc in batch_docs],
metadatas=[{'source': doc['source']} for doc in batch_docs]
)
print(f"[store] Indexed {len(documents)} chunks into ChromaDB")
return collection
Step 4: Retrieval
def retrieve_context(
query: str,
collection: chromadb.Collection,
embed_model: str = "nomic-embed-text",
n_results: int = 5
) -> list[dict]:
"""
Find the most relevant document chunks for a query.
"""
# Embed the query using the same model
query_embedding = ollama.embeddings(model=embed_model, prompt=query)['embedding']
results = collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
include=['documents', 'metadatas', 'distances']
)
context_chunks = []
for doc, meta, dist in zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
):
context_chunks.append({
'content': doc,
'source': meta.get('source', 'unknown'),
'relevance': round(1 - dist, 3) # Convert distance to similarity
})
return context_chunks
Step 5: Generation
import requests
import json
def generate_answer(
query: str,
context_chunks: list[dict],
model: str = "qwen2.5:14b",
ollama_url: str = "http://localhost:11434"
) -> str:
"""
Generate an answer using retrieved context and a local LLM.
"""
# Build context block
context_text = "\n\n---\n\n".join([
f"Source: {chunk['source']}\n{chunk['content']}"
for chunk in context_chunks
])
prompt = f"""You are a helpful assistant. Answer the question using ONLY the provided context.
If the answer isn't in the context, say so clearly. Do not make up information.
CONTEXT:
{context_text}
QUESTION: {query}
ANSWER:"""
response = requests.post(
f"{ollama_url}/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.1} # Low temp for factual Q&A
},
timeout=120
)
response.raise_for_status()
return response.json()['response'].strip()
Step 6: Putting It All Together
class LocalRAG:
"""Full local RAG pipeline — zero cloud dependencies."""
def __init__(
self,
docs_dir: str,
persist_dir: str = "./chroma_db",
embed_model: str = "nomic-embed-text",
gen_model: str = "qwen2.5:14b",
collection_name: str = "local_rag"
):
self.embed_model = embed_model
self.gen_model = gen_model
self.collection_name = collection_name
self.persist_dir = persist_dir
print(f"[rag] Initializing with docs from: {docs_dir}")
# Load and chunk documents
documents = load_documents(docs_dir)
# Generate embeddings
print(f"[rag] Embedding {len(documents)} chunks...")
texts = [doc['content'] for doc in documents]
embeddings = embed_texts(texts, model=embed_model)
# Store in ChromaDB
self.collection = build_vector_store(
documents, embeddings,
collection_name=collection_name,
persist_dir=persist_dir
)
print("[rag] Ready.")
def query(self, question: str, n_context: int = 5, verbose: bool = False) -> str:
"""Answer a question using local retrieval + generation."""
# Retrieve relevant chunks
context = retrieve_context(
question, self.collection,
embed_model=self.embed_model,
n_results=n_context
)
if verbose:
print(f"\n[retrieve] Top {len(context)} chunks:")
for c in context:
print(f" [{c['relevance']:.2f}] {c['source']}: {c['content'][:80]}...")
# Generate answer
return generate_answer(question, context, model=self.gen_model)
# --- Usage ---
if __name__ == "__main__":
import sys
docs_dir = sys.argv[1] if len(sys.argv) > 1 else "./docs"
rag = LocalRAG(docs_dir=docs_dir)
print("\nLocal RAG ready. Type your questions (Ctrl+C to exit):\n")
while True:
try:
question = input("Q: ").strip()
if not question:
continue
answer = rag.query(question, verbose=True)
print(f"\nA: {answer}\n")
except KeyboardInterrupt:
print("\nDone.")
break
Running It
# Index your documents
python rag.py ./my_docs
# Output:
# [ingest] Loaded 342 chunks from ./my_docs
# [rag] Embedding 342 chunks...
# [embed] Processing chunk 0/342...
# [embed] Processing chunk 50/342...
# [store] Indexed 342 chunks into ChromaDB
# [rag] Ready.
#
# Local RAG ready. Type your questions:
#
# Q: What does the authentication module do?
# [retrieve] Top 5 chunks:
# [0.94] ./my_docs/auth.md: The authentication module handles...
# A: The authentication module handles JWT token validation and...
Performance on Local Hardware
Tested on an Intel tower, Ubuntu 24.04, 32 GB RAM, no GPU:
| Operation | Time | Notes |
|---|---|---|
| Embed 100 chunks | ~8s | nomic-embed-text, CPU |
| Embed 1000 chunks | ~75s | One-time indexing cost |
| Retrieval query | <100ms | ChromaDB is fast |
| Generation (14B) | 10-20s | Depends on answer length |
| Total Q&A latency | ~15-25s | Perfectly fine for async use |
For real-time applications, run the indexing once and keep the collection persistent. Retrieval is nearly instant — only generation adds latency.
Drop-In OpenAI Replacement
If you have existing code using OpenAI's embedding API, swap it out:
# Before (OpenAI)
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(input=text, model="text-embedding-3-small")
embedding = response.data[0].embedding
# After (Local Ollama — same result, zero cost)
import ollama
response = ollama.embeddings(model="nomic-embed-text", prompt=text)
embedding = response['embedding']
Same vector space semantics. Zero API cost.
What to Build With This
| Use case | Index target | Value |
|---|---|---|
| Codebase Q&A | Your repo | Dev productivity |
| Docs chatbot | Product docs | Customer support |
| Research assistant | PDF papers | Knowledge work |
| Log analysis | Server logs | Ops tooling |
| Personal knowledge base | Notes/Obsidian | Second brain |
All of these are client deliverables. All run on a $600 desktop. All cost $0/month in API fees.
Full Stack Summary
Documents → chunk_text() → embed_texts() → ChromaDB
↓
Query → embed_texts() → ChromaDB.query() → top-k chunks
↓
generate_answer() → Ollama → Response
No cloud. No vendor lock-in. No surprise bills.
If you want to pair this with a persistent API server, check out my guide on running a local AI coding agent with Ollama — the setup is identical, just point the generation step at the same Ollama instance.
Drop a comment with what you're indexing — always curious what people are pointing RAG at.
Top comments (0)