Companion code: MukundaKatta/ragvitals-gemma-demo. The pgai + Ollama path is the
demo/pgai_ollama_run.pyentry point.
If you have ever shipped RAG to production, you know the discovery: the day you swap a model, change an embedder, or re-index the corpus, something will move that you did not expect. Faithfulness drops. Retrieval recall sags. Query intent shifts under your nose. Most of the time you find out from a support ticket, not a dashboard.
The point of this post is to show that you can build the whole pipeline on open source and still have the observability part be a first-class citizen, not an afterthought you bolt on later. The stack:
- pgvector for vector storage
- pgai for embedding and generation, run inline as SQL functions
- Ollama for serving Gemma 2 9B (generator), Llama 3.1 8B (judge), and Nomic Embed (embedder), all locally
- ragvitals for a 5-dimensional drift report over every call
No managed embedding endpoint. No API key. The model weights live on your machine, the vectors live in your Postgres, and the alarm logic lives in a tiny Python library you can read end to end.
Why Postgres for the vector layer
The temptation when you start a RAG project is to reach for a dedicated vector database. Pinecone, Weaviate, Qdrant, Milvus. They are good products. They are also another piece of infrastructure to operate, secure, and budget.
pgvector is good enough for most teams. pgvectorscale's StreamingDiskANN index scales to billions of vectors without leaving Postgres. pgai puts the embedding and generation calls inside the database itself, so your retrieval-and-generation step is one round trip, not three. Backups, migrations, observability, access control, and replication all use the same Postgres tooling you already know.
For a team that already runs Postgres, this is the lowest-friction RAG stack on the planet.
The setup
docker run --rm -d -p 5432:5432 -e POSTGRES_PASSWORD=postgres \
timescale/timescaledb-ha:pg17-all
# Inside the container:
psql -U postgres -c "CREATE EXTENSION vector;"
psql -U postgres -c "CREATE EXTENSION ai CASCADE;"
# Locally:
ollama pull gemma2:9b
ollama pull llama3.1:8b
ollama pull nomic-embed-text
That gets you a working Postgres with both extensions and three Ollama-served models ready on http://localhost:11434.
The schema
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
CREATE TABLE rag_docs (
id text PRIMARY KEY,
content text NOT NULL,
embedding vector(768)
);
That's it. One table, one vector column. No separate embedding service, no Redis cache, no message queue.
Embed on insert with pgai
pgai.ollama_embed is a SQL function that calls your local Ollama server and returns the embedding as a vector. Insert with the embedding derived from the content directly:
INSERT INTO rag_docs (id, content, embedding)
VALUES (
'doc-prompt-cache-1',
'Anthropic prompt caching: ephemeral cache_control on content blocks reduces input cost by 90% on cache reads.',
ai.ollama_embed('nomic-embed-text',
'Anthropic prompt caching: ephemeral cache_control on content blocks reduces input cost by 90% on cache reads.')::vector
);
The vectorizer pattern (pgai.create_vectorizer) wraps this in a background worker that watches a base table and keeps an embedding table in sync. For the demo I do it inline at insert time because the codebase is 12 docs and the workflow is "ingest once, retrieve forever." For a real production corpus the vectorizer is the right answer.
Retrieve with one SQL statement
SELECT id, content,
1 - (embedding <=> ai.ollama_embed('nomic-embed-text', $1)::vector) AS score
FROM rag_docs
ORDER BY embedding <=> ai.ollama_embed('nomic-embed-text', $1)::vector
LIMIT 3;
The <=> operator is cosine distance. 1 - cosine_distance gives you a cosine similarity score you can rank by. The retrieval and the embedding both happen inside Postgres in a single round trip.
Generate with pgai.ollama_generate
SELECT ai.ollama_generate('gemma2:9b',
'Answer using only the context:' || E'\n' || $context || E'\n\nQ: ' || $query || E'\nA:'
)->>'response';
ai.ollama_generate returns a JSON object; ->>'response' extracts the generated text. Wrap that with the standard "answer using only the context" RAG prompt and you have generation inside the same database round trip as retrieval.
Wire it into ragvitals
The whole point of building this on open source is to trust but verify. Every retrieved row, every generation, every judge score becomes a Trace fed into a Detector:
from ragvitals import (
Detector, EmbeddingDrift, InMemorySink, JudgeDrift,
QueryDistribution, ResponseQuality, RetrievalRelevance, Trace,
)
q = QueryDistribution(); q.set_reference(corpus_embeddings)
e = EmbeddingDrift(); e.set_reference(corpus_embeddings)
j = JudgeDrift(score_key="faithfulness")
j.set_reference({f"ref-{i}": 0.85 for i in range(10)})
det = Detector(dimensions=[
q,
RetrievalRelevance(metric="hit_rate", k=3),
e,
ResponseQuality(score_keys=["faithfulness", "relevance"]),
j,
], sinks=[InMemorySink()])
for query in production_queries:
retrieved = retrieve(conn, query, k=3) # pgai inline retrieval
answer = generate(conn, query, [r[1] for r in retrieved], "gemma2:9b")
scores = judge(conn, query, answer, "llama3.1:8b")
det.ingest(Trace(
timestamp=now(),
query=query,
query_embedding=pgai_embed(conn, query),
retrieval_scores=[r[2] for r in retrieved],
relevance_labels=[1] + [0] * (len(retrieved) - 1),
response=answer,
judge_scores=scores,
metadata={"model": "gemma2:9b"},
))
print(det.report())
Run this once a day, commit the window, and you have a rolling baseline of what "normal" looks like. The next time you swap to llama3.1:8b as the generator, you will see ResponseQuality.faithfulness move and the other four dimensions stay flat. If you swap to a different embedder, QueryDistribution and EmbeddingDrift move and the others stay flat. The dimension that fires identifies the cause.
What this stack gets right
-
Single round trip. The retrieval, the embedding of the live query, and the generation all happen inside Postgres via
pgai. No fan-out, no marshaling between three services. - Open weights end to end. Gemma 2 and Llama 3.1 are both open-source. Nomic Embed is open-source. No API key, no rate limit, no per-token billing.
-
Observability is a library, not a vendor.
ragvitalswrites to whatever sink you configure. The default is an in-memory ring; the demo also ships a JSONL sink and a CloudWatch sink behind an optional dependency. - Everything is a SQL function. If your ops team can debug Postgres, they can debug this stack.
What this stack does NOT solve
- GPU economics. Running Gemma 2 9B locally takes a real GPU or at least an Apple-Silicon Mac with 32GB of RAM. For a production team this means a dedicated inference box, not the same machine running Postgres.
-
Multi-tenant isolation. The demo uses a single
rag_docstable. For multi-tenant SaaS you want row-level security or schema-per-tenant; pgvector indexes scale fine but the policy layer is on you. -
Long-context retrieval. Top-3 cosine is the right default for short docs. If your corpus is structured (code, contracts, papers), a hybrid BM25 + vector approach beats pure cosine. Look at TigerData's
pg_searchor a separate Elasticsearch shard.
Reproduce
git clone https://github.com/MukundaKatta/ragvitals-gemma-demo
cd ragvitals-gemma-demo
pip install -e ".[pgai]"
# Start Postgres + pgai
docker run --rm -d -p 5432:5432 --name pgai \
-e POSTGRES_PASSWORD=postgres timescale/timescaledb-ha:pg17-all
# Start Ollama (separate terminal) + pull models
ollama pull gemma2:9b
ollama pull llama3.1:8b
ollama pull nomic-embed-text
# One-time ingest
python demo/pgai_ollama_run.py --setup
# Run 20 queries and print the drift report
python demo/pgai_ollama_run.py --run --n 20
Output looks like this:
=== ragvitals report ===
QueryDistribution ok value= 0.0312 n=20
RetrievalRelevance ok value= 0.8000 n=20
EmbeddingDrift ok value= 0.0312 n=20
ResponseQuality.faithfulness ok value= 0.7950 n=20
JudgeDrift ok value= n/a n=0
degraded: []
warned: []
Now swap the generator to llama3.1:8b, re-run, and compare. If your RetrievalRelevance shifts, your queries shifted. If only ResponseQuality shifts, you changed the generator and the rest of the pipeline is invariant, which is exactly what you wanted to confirm.
The libraries used
-
ragvitals: the 5-dimensional drift detector. -
bedrockcacheandbedrockstack: siblings of ragvitals for teams running Anthropic on AWS Bedrock instead of Ollama locally. - pgai and pgvector: the database side, built and maintained by Tiger Data (formerly Timescale) and Andrew Kane respectively.
- Ollama: the runner.
Closing
The open-source AI stack is good enough for production RAG today. The piece most teams skip is monitoring, because it has historically meant adopting a platform. It does not have to. Five dimensions, a sink protocol, and a library you can read end to end is enough to catch the silent regressions before they become support tickets.
If you build something with this stack, comment with what your monitoring caught the first time you swapped a component. That is the highest-signal feedback there is.
Top comments (0)