DEV Community

SIGNAL
SIGNAL

Posted on

Build a Local RAG Pipeline With Ollama + pgvector — No API Keys, No Cloud

Retrieval-Augmented Generation is one of those ideas that sounds complex until you actually build it. At its core: shove documents into a vector database, embed a user query the same way, find the closest matches, and feed them to an LLM as context. That's it.

The problem? Most tutorials wire this to OpenAI embeddings and Pinecone, meaning you pay per token and your data leaves your machine. Let's fix that.

This guide builds a fully local RAG pipeline:

  • Ollama for the LLM and embeddings
  • PostgreSQL + pgvector as the vector store
  • Python to glue it together

100% offline. No API keys. No cloud.


What You Need

  • Docker (for Postgres + pgvector)
  • Ollama installed locally
  • Python 3.11+
  • ~4 GB free RAM

Pull the models first:

ollama pull nomic-embed-text   # 274MB embedding model
ollama pull llama3.2           # ~2GB, fast on CPU
Enter fullscreen mode Exit fullscreen mode

Step 1: Spin Up pgvector

pgvector adds a vector column type and similarity search operators to Postgres:

docker run -d \
  --name pgvector \
  -e POSTGRES_PASSWORD=localrag \
  -e POSTGRES_DB=ragdb \
  -p 5432:5432 \
  pgvector/pgvector:pg16
Enter fullscreen mode Exit fullscreen mode

Create the schema (connect via docker exec -it pgvector psql -U postgres -d ragdb):

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  source TEXT,
  chunk_index INT,
  content TEXT,
  embedding vector(768)
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);
Enter fullscreen mode Exit fullscreen mode

The ivfflat index makes similarity search fast even with thousands of chunks.


Step 2: Ingest Documents

pip install psycopg2-binary requests
Enter fullscreen mode Exit fullscreen mode
# ingest.py
import os, sys, requests, psycopg2

OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
CHUNK_SIZE = 400
CHUNK_OVERLAP = 50

DB = psycopg2.connect(
    host="localhost", port=5432,
    dbname="ragdb", user="postgres", password="localrag"
)

def embed(text):
    r = requests.post(f"{OLLAMA_URL}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text})
    r.raise_for_status()
    return r.json()["embedding"]

def chunk(text):
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start:start + CHUNK_SIZE])
        start += CHUNK_SIZE - CHUNK_OVERLAP
    return chunks

def ingest_file(path):
    text = open(path).read()
    chunks = chunk(text)
    cur = DB.cursor()
    for i, c in enumerate(chunks):
        cur.execute(
            "INSERT INTO documents (source, chunk_index, content, embedding) "
            "VALUES (%s, %s, %s, %s)",
            (path, i, c, embed(c))
        )
    DB.commit()
    print(f"  {path} -> {len(chunks)} chunks")

if __name__ == "__main__":
    folder = sys.argv[1] if len(sys.argv) > 1 else "."
    for root, _, files in os.walk(folder):
        for fname in files:
            if fname.endswith((".txt", ".md")):
                ingest_file(os.path.join(root, fname))
Enter fullscreen mode Exit fullscreen mode
python ingest.py ./my-docs/
Enter fullscreen mode Exit fullscreen mode

A 50-page doc takes about 30 seconds on a Mac mini M4.


Step 3: Query It

# query.py
import sys, requests, psycopg2

OLLAMA_URL = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
LLM_MODEL   = "llama3.2"
TOP_K = 4

DB = psycopg2.connect(
    host="localhost", port=5432,
    dbname="ragdb", user="postgres", password="localrag"
)

def embed(text):
    r = requests.post(f"{OLLAMA_URL}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text})
    r.raise_for_status()
    return r.json()["embedding"]

def retrieve(question):
    vec = embed(question)
    cur = DB.cursor()
    cur.execute(
        "SELECT content FROM documents "
        "ORDER BY embedding <=> %s::vector LIMIT %s",
        (vec, TOP_K)
    )
    return [row[0] for row in cur.fetchall()]

def ask(question):
    context = "\n\n---\n\n".join(retrieve(question))
    prompt = (
        f"Answer using only the context below.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\nAnswer:"
    )
    r = requests.post(f"{OLLAMA_URL}/api/generate",
        json={"model": LLM_MODEL, "prompt": prompt, "stream": False})
    r.raise_for_status()
    return r.json()["response"]

if __name__ == "__main__":
    print(ask(" ".join(sys.argv[1:])))
Enter fullscreen mode Exit fullscreen mode
python query.py "What does the deployment guide say about rollbacks?"
Enter fullscreen mode Exit fullscreen mode

The <=> operator is pgvector's cosine distance. ORDER BY ... LIMIT 4 pulls the 4 most relevant chunks and feeds them as context to the LLM.


Step 4: Expose It as an API

# server.py
from fastapi import FastAPI
from pydantic import BaseModel
from query import ask

app = FastAPI()

class Q(BaseModel):
    question: str

@app.post("/ask")
def ask_endpoint(q: Q):
    return {"answer": ask(q.question)}
Enter fullscreen mode Exit fullscreen mode
pip install fastapi uvicorn
uvicorn server:app --port 8080
Enter fullscreen mode Exit fullscreen mode
curl -s -X POST http://localhost:8080/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the backup procedures?"}' | jq .answer
Enter fullscreen mode Exit fullscreen mode

Performance on Real Hardware

Mac mini M4, 16GB RAM:

Operation Time
Embed 1000 chunks ~2 min
Query (embed + retrieve + generate) 3-5 sec
pgvector search on 10k rows <50ms

For 100k+ chunks, switch to HNSW indexing:

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
Enter fullscreen mode Exit fullscreen mode

HNSW has faster queries but slower builds and higher RAM usage. ivfflat is fine for most homelab workloads.


What to Do Next

  • Chunk smarter — use sentence boundaries instead of fixed character counts (nltk.sent_tokenize())
  • Hybrid search — combine pgvector with tsvector full-text search for better recall on exact terms
  • Re-ranking — after top-K retrieval, use a cross-encoder to re-rank before sending to the LLM
  • Source citations — the source column is already there; return it so you know which file the answer came from

The whole stack runs on hardware you probably already own. No API bills, no data leaving your network, no vendor lock-in.

pgvector + nomic-embed-text + llama3.2 = private RAG that actually works.

Top comments (0)