Retrieval-Augmented Generation is one of those ideas that sounds complex until you actually build it. At its core: shove documents into a vector database, embed a user query the same way, find the closest matches, and feed them to an LLM as context. That's it.
The problem? Most tutorials wire this to OpenAI embeddings and Pinecone, meaning you pay per token and your data leaves your machine. Let's fix that.
This guide builds a fully local RAG pipeline:
- Ollama for the LLM and embeddings
- PostgreSQL + pgvector as the vector store
- Python to glue it together
100% offline. No API keys. No cloud.
What You Need
- Docker (for Postgres + pgvector)
- Ollama installed locally
- Python 3.11+
- ~4 GB free RAM
Pull the models first:
ollama pull nomic-embed-text # 274MB embedding model
ollama pull llama3.2 # ~2GB, fast on CPU
Step 1: Spin Up pgvector
pgvector adds a vector column type and similarity search operators to Postgres:
docker run -d \
--name pgvector \
-e POSTGRES_PASSWORD=localrag \
-e POSTGRES_DB=ragdb \
-p 5432:5432 \
pgvector/pgvector:pg16
Create the schema (connect via docker exec -it pgvector psql -U postgres -d ragdb):
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
source TEXT,
chunk_index INT,
content TEXT,
embedding vector(768)
);
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
The ivfflat index makes similarity search fast even with thousands of chunks.
Step 2: Ingest Documents
pip install psycopg2-binary requests
# ingest.py
import os, sys, requests, psycopg2
OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
CHUNK_SIZE = 400
CHUNK_OVERLAP = 50
DB = psycopg2.connect(
host="localhost", port=5432,
dbname="ragdb", user="postgres", password="localrag"
)
def embed(text):
r = requests.post(f"{OLLAMA_URL}/api/embeddings",
json={"model": EMBED_MODEL, "prompt": text})
r.raise_for_status()
return r.json()["embedding"]
def chunk(text):
chunks, start = [], 0
while start < len(text):
chunks.append(text[start:start + CHUNK_SIZE])
start += CHUNK_SIZE - CHUNK_OVERLAP
return chunks
def ingest_file(path):
text = open(path).read()
chunks = chunk(text)
cur = DB.cursor()
for i, c in enumerate(chunks):
cur.execute(
"INSERT INTO documents (source, chunk_index, content, embedding) "
"VALUES (%s, %s, %s, %s)",
(path, i, c, embed(c))
)
DB.commit()
print(f" {path} -> {len(chunks)} chunks")
if __name__ == "__main__":
folder = sys.argv[1] if len(sys.argv) > 1 else "."
for root, _, files in os.walk(folder):
for fname in files:
if fname.endswith((".txt", ".md")):
ingest_file(os.path.join(root, fname))
python ingest.py ./my-docs/
A 50-page doc takes about 30 seconds on a Mac mini M4.
Step 3: Query It
# query.py
import sys, requests, psycopg2
OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL = "llama3.2"
TOP_K = 4
DB = psycopg2.connect(
host="localhost", port=5432,
dbname="ragdb", user="postgres", password="localrag"
)
def embed(text):
r = requests.post(f"{OLLAMA_URL}/api/embeddings",
json={"model": EMBED_MODEL, "prompt": text})
r.raise_for_status()
return r.json()["embedding"]
def retrieve(question):
vec = embed(question)
cur = DB.cursor()
cur.execute(
"SELECT content FROM documents "
"ORDER BY embedding <=> %s::vector LIMIT %s",
(vec, TOP_K)
)
return [row[0] for row in cur.fetchall()]
def ask(question):
context = "\n\n---\n\n".join(retrieve(question))
prompt = (
f"Answer using only the context below.\n\n"
f"Context:\n{context}\n\n"
f"Question: {question}\nAnswer:"
)
r = requests.post(f"{OLLAMA_URL}/api/generate",
json={"model": LLM_MODEL, "prompt": prompt, "stream": False})
r.raise_for_status()
return r.json()["response"]
if __name__ == "__main__":
print(ask(" ".join(sys.argv[1:])))
python query.py "What does the deployment guide say about rollbacks?"
The <=> operator is pgvector's cosine distance. ORDER BY ... LIMIT 4 pulls the 4 most relevant chunks and feeds them as context to the LLM.
Step 4: Expose It as an API
# server.py
from fastapi import FastAPI
from pydantic import BaseModel
from query import ask
app = FastAPI()
class Q(BaseModel):
question: str
@app.post("/ask")
def ask_endpoint(q: Q):
return {"answer": ask(q.question)}
pip install fastapi uvicorn
uvicorn server:app --port 8080
curl -s -X POST http://localhost:8080/ask \
-H "Content-Type: application/json" \
-d '{"question": "What are the backup procedures?"}' | jq .answer
Performance on Real Hardware
Mac mini M4, 16GB RAM:
| Operation | Time |
|---|---|
| Embed 1000 chunks | ~2 min |
| Query (embed + retrieve + generate) | 3-5 sec |
| pgvector search on 10k rows | <50ms |
For 100k+ chunks, switch to HNSW indexing:
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
HNSW has faster queries but slower builds and higher RAM usage. ivfflat is fine for most homelab workloads.
What to Do Next
-
Chunk smarter — use sentence boundaries instead of fixed character counts (
nltk.sent_tokenize()) -
Hybrid search — combine pgvector with
tsvectorfull-text search for better recall on exact terms - Re-ranking — after top-K retrieval, use a cross-encoder to re-rank before sending to the LLM
-
Source citations — the
sourcecolumn is already there; return it so you know which file the answer came from
The whole stack runs on hardware you probably already own. No API bills, no data leaving your network, no vendor lock-in.
pgvector + nomic-embed-text + llama3.2 = private RAG that actually works.
Top comments (0)