SIGNAL

Posted on Mar 18

Build a Local RAG Pipeline With Ollama + pgvector — No API Keys, No Cloud

#ai #selfhosted #docker #beginners

Retrieval-Augmented Generation is one of those ideas that sounds complex until you actually build it. At its core: shove documents into a vector database, embed a user query the same way, find the closest matches, and feed them to an LLM as context. That's it.

The problem? Most tutorials wire this to OpenAI embeddings and Pinecone, meaning you pay per token and your data leaves your machine. Let's fix that.

This guide builds a fully local RAG pipeline:

Ollama for the LLM and embeddings
PostgreSQL + pgvector as the vector store
Python to glue it together

100% offline. No API keys. No cloud.

What You Need

Docker (for Postgres + pgvector)
Ollama installed locally
Python 3.11+
~4 GB free RAM

Pull the models first:

ollama pull nomic-embed-text   # 274MB embedding model
ollama pull llama3.2           # ~2GB, fast on CPU

Step 1: Spin Up pgvector

pgvector adds a vector column type and similarity search operators to Postgres:

docker run -d \
  --name pgvector \
  -e POSTGRES_PASSWORD=localrag \
  -e POSTGRES_DB=ragdb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Create the schema (connect via docker exec -it pgvector psql -U postgres -d ragdb):

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  source TEXT,
  chunk_index INT,
  content TEXT,
  embedding vector(768)
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

The ivfflat index makes similarity search fast even with thousands of chunks.

Step 2: Ingest Documents

pip install psycopg2-binary requests

# ingest.py
import os, sys, requests, psycopg2

OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
CHUNK_SIZE = 400
CHUNK_OVERLAP = 50

DB = psycopg2.connect(
    host="localhost", port=5432,
    dbname="ragdb", user="postgres", password="localrag"
)

def embed(text):
    r = requests.post(f"{OLLAMA_URL}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text})
    r.raise_for_status()
    return r.json()["embedding"]

def chunk(text):
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start:start + CHUNK_SIZE])
        start += CHUNK_SIZE - CHUNK_OVERLAP
    return chunks

def ingest_file(path):
    text = open(path).read()
    chunks = chunk(text)
    cur = DB.cursor()
    for i, c in enumerate(chunks):
        cur.execute(
            "INSERT INTO documents (source, chunk_index, content, embedding) "
            "VALUES (%s, %s, %s, %s)",
            (path, i, c, embed(c))
        )
    DB.commit()
    print(f"  {path} -> {len(chunks)} chunks")

if __name__ == "__main__":
    folder = sys.argv[1] if len(sys.argv) > 1 else "."
    for root, _, files in os.walk(folder):
        for fname in files:
            if fname.endswith((".txt", ".md")):
                ingest_file(os.path.join(root, fname))

python ingest.py ./my-docs/

A 50-page doc takes about 30 seconds on a Mac mini M4.

Step 3: Query It

# query.py
import sys, requests, psycopg2

OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL   = "llama3.2"
TOP_K = 4

DB = psycopg2.connect(
    host="localhost", port=5432,
    dbname="ragdb", user="postgres", password="localrag"
)

def embed(text):
    r = requests.post(f"{OLLAMA_URL}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text})
    r.raise_for_status()
    return r.json()["embedding"]

def retrieve(question):
    vec = embed(question)
    cur = DB.cursor()
    cur.execute(
        "SELECT content FROM documents "
        "ORDER BY embedding <=> %s::vector LIMIT %s",
        (vec, TOP_K)
    )
    return [row[0] for row in cur.fetchall()]

def ask(question):
    context = "\n\n---\n\n".join(retrieve(question))
    prompt = (
        f"Answer using only the context below.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\nAnswer:"
    )
    r = requests.post(f"{OLLAMA_URL}/api/generate",
        json={"model": LLM_MODEL, "prompt": prompt, "stream": False})
    r.raise_for_status()
    return r.json()["response"]

if __name__ == "__main__":
    print(ask(" ".join(sys.argv[1:])))

python query.py "What does the deployment guide say about rollbacks?"

The <=> operator is pgvector's cosine distance. ORDER BY ... LIMIT 4 pulls the 4 most relevant chunks and feeds them as context to the LLM.

Step 4: Expose It as an API

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
from query import ask

app = FastAPI()

class Q(BaseModel):
    question: str

@app.post("/ask")
def ask_endpoint(q: Q):
    return {"answer": ask(q.question)}

pip install fastapi uvicorn
uvicorn server:app --port 8080

curl -s -X POST http://localhost:8080/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the backup procedures?"}' | jq .answer

Performance on Real Hardware

Mac mini M4, 16GB RAM:

Operation	Time
Embed 1000 chunks	~2 min
Query (embed + retrieve + generate)	3-5 sec
pgvector search on 10k rows	<50ms

For 100k+ chunks, switch to HNSW indexing:

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

HNSW has faster queries but slower builds and higher RAM usage. ivfflat is fine for most homelab workloads.

What to Do Next

Chunk smarter — use sentence boundaries instead of fixed character counts (nltk.sent_tokenize())
Hybrid search — combine pgvector with tsvector full-text search for better recall on exact terms
Re-ranking — after top-K retrieval, use a cross-encoder to re-rank before sending to the LLM
Source citations — the source column is already there; return it so you know which file the answer came from

The whole stack runs on hardware you probably already own. No API bills, no data leaving your network, no vendor lock-in.

pgvector + nomic-embed-text + llama3.2 = private RAG that actually works.

DEV Community