Ahmet Zeybek

Posted on Mar 7 • Edited on Jun 19 • Originally published at book.zeybek.dev

I Deleted Pinecone, Redis, and 400 Lines of Python. My RAG Pipeline Still Works.

#tutorial #postgres #ai #machinelearning

Last November I had a Pinecone bill for $70. For a side project. That nobody was using yet.

I sat there looking at my architecture and counted: Pinecone for vectors, Redis for caching, a FastAPI service hitting OpenAI's embedding endpoint, LangChain stitching it all together with 400-something lines of Python, and PostgreSQL — which had my actual data the whole time, just sitting there doing nothing interesting.

Five moving parts. I couldn't even run the thing locally without spinning up four Docker containers and praying they'd all connect. When Pinecone had that 20-minute outage in October, my "AI-powered" app just... died. The documents were in Postgres. The user was talking to Postgres. But the retrieval step went through San Francisco and back, so tough luck.

I spent a weekend ripping it all out. Here's what replaced it.

Before and after

Before:

User query → FastAPI → OpenAI (embed) → Pinecone (search) → Redis (cache check)
          → PostgreSQL (get docs) → OpenAI (generate) → Response

After:

User query → PostgreSQL → Response

Same results. Actually better results, because I could finally do hybrid search (more on that later). And the Pinecone bill went to $0.

The setup

Two extensions. That's the infrastructure change.

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS ai;

pgvector handles vector storage and similarity search. pgai talks to Ollama (or OpenAI, if you prefer) for embeddings and LLM calls. Both install in seconds.

I keep seeing people treat this like some exotic hack. It's not. pgvector has been around since 2021 and Supabase, Neon, and every major Postgres cloud provider ships it by default now.

The table

Nothing clever here:

CREATE TABLE documents (
    id          BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    source      TEXT NOT NULL,
    title       TEXT NOT NULL,
    content     TEXT NOT NULL,
    chunk       TEXT NOT NULL,
    chunk_index INTEGER NOT NULL DEFAULT 0,
    embedding   vector(768),
    metadata    JSONB DEFAULT '{}',
    created_at  TIMESTAMPTZ DEFAULT now(),
    updated_at  TIMESTAMPTZ DEFAULT now(),

    UNIQUE(source, chunk_index)
);

The embedding column is a native pgvector type. 768 dimensions because I'm using nomic-embed-text, but you'd use 1536 for OpenAI's ada-002, 384 for all-MiniLM, whatever. The point is it lives right next to the text. No syncing. No "oh the embedding store has a different version of this document than the main database" nonsense.

Generating embeddings without leaving SQL

This was the part where I went "wait, really?"

UPDATE documents
SET embedding = ai.ollama_embed(
    'nomic-embed-text',
    chunk,
    host => 'http://ollama:11434'
)
WHERE embedding IS NULL;

One statement. Every row that's missing an embedding gets one. I don't need a Python script, I don't need a queue, I don't need to worry about batching or rate limits. Ollama runs locally, so it's as fast as my GPU allows.

For new inserts, a trigger handles it:

CREATE OR REPLACE FUNCTION generate_embedding()
RETURNS TRIGGER AS $$
BEGIN
    NEW.embedding := ai.ollama_embed(
        'nomic-embed-text',
        NEW.chunk,
        host => 'http://ollama:11434'
    );
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_embed_on_insert
    BEFORE INSERT ON documents
    FOR EACH ROW
    WHEN (NEW.embedding IS NULL)
    EXECUTE FUNCTION generate_embedding();

INSERT a row, get an embedding. No application code involved.

Indexing (because seq scans don't scale)

pgvector without an index does a sequential scan. Fine for a few thousand rows. Painful at 100K+. Unusable at 1M.

CREATE INDEX idx_documents_hnsw
ON documents USING hnsw (
    embedding vector_cosine_ops
)
WITH (m = 16, ef_construction = 128);

m = 16 means each graph node connects to 16 neighbors. ef_construction = 128 is how hard the algorithm works during index creation. These aren't magic numbers — I tested a bunch of combinations and these gave the best recall/speed balance for my dataset. Your mileage will vary.

At query time:

SET hnsw.ef_search = 100;  -- trade speed for accuracy

Numbers from my M2 MacBook, no caching, cold queries:

Rows	Index build	Query (top-10)	Recall@10
100K	4.2s	1.8ms	0.98
500K	23s	2.4ms	0.97
1M	51s	3.1ms	0.96

Under 4ms for a million rows on a laptop. I'll take it.

Similarity search

SELECT
    title,
    chunk,
    1 - (embedding <=> ai.ollama_embed(
        'nomic-embed-text',
        'How do I implement vector search in PostgreSQL?',
        host => 'http://ollama:11434'
    )) AS similarity
FROM documents
ORDER BY embedding <=> ai.ollama_embed(
    'nomic-embed-text',
    'How do I implement vector search in PostgreSQL?',
    host => 'http://ollama:11434'
)
LIMIT 5;

<=> is cosine distance. Lower = more similar. That's your retrieval step.

But here's a thing that bothered me for weeks: pure vector search misses exact keyword matches. If someone types "pgvector HNSW ef_construction parameter", vector search might return results about "index tuning strategies" — semantically close, but missing the exact terms the user typed. Frustrating.

So I combined it with something Pinecone literally can't do.

Hybrid search with Reciprocal Rank Fusion

Postgres has had full-text search since... 2005? 2006? A long time. pgvector added vector search. Why not use both?

WITH semantic AS (
    SELECT id, chunk, title,
           ROW_NUMBER() OVER (
               ORDER BY embedding <=> ai.ollama_embed(
                   'nomic-embed-text', $1,
                   host => 'http://ollama:11434')
           ) AS rank_semantic
    FROM documents
    ORDER BY embedding <=> ai.ollama_embed(
        'nomic-embed-text', $1,
        host => 'http://ollama:11434')
    LIMIT 20
),
fulltext AS (
    SELECT id, chunk, title,
           ROW_NUMBER() OVER (
               ORDER BY ts_rank(
                   to_tsvector('english', chunk),
                   plainto_tsquery('english', $1)
               ) DESC
           ) AS rank_fulltext
    FROM documents
    WHERE to_tsvector('english', chunk) @@
          plainto_tsquery('english', $1)
    LIMIT 20
)
SELECT
    COALESCE(s.id, f.id) AS id,
    COALESCE(s.title, f.title) AS title,
    COALESCE(s.chunk, f.chunk) AS chunk,
    COALESCE(1.0 / (60 + s.rank_semantic), 0) +
    COALESCE(1.0 / (60 + f.rank_fulltext), 0) AS rrf_score
FROM semantic s
FULL OUTER JOIN fulltext f ON s.id = f.id
ORDER BY rrf_score DESC
LIMIT 5;

RRF takes two ranked lists and merges them. The 60 constant is a smoothing factor from the original Microsoft paper — it stops either ranking from completely dominating. You can tune it, but 60 works well enough that I never bothered changing it.

The result: semantic "vibes" plus keyword precision. My retrieval quality went up noticeably after switching to this. Stuff that vector-only search missed started showing up.

The full RAG query

One SQL statement. Embeds the question, retrieves context, generates an answer.

SELECT ai.ollama_chat_complete(
    'llama3',
    jsonb_build_array(
        jsonb_build_object(
            'role', 'system',
            'content', 'Answer using only the context below. '
                || 'If the context does not contain the answer, say so. '
                || 'Cite the source document for each fact.'
                || E'\n\nContext:\n'
                || (
                    SELECT string_agg(
                        '--- [' || title || '] ---' || E'\n' || chunk,
                        E'\n\n'
                    )
                    FROM documents
                    ORDER BY embedding <=> ai.ollama_embed(
                        'nomic-embed-text',
                        'How do I choose between HNSW and IVFFlat?',
                        host => 'http://ollama:11434'
                    )
                    LIMIT 5
                )
        ),
        jsonb_build_object(
            'role', 'user',
            'content', 'How do I choose between HNSW and IVFFlat?'
        )
    ),
    host => 'http://ollama:11434'
);

I still find it slightly absurd that this works. A SQL query that does retrieval-augmented generation. No Python in sight. No orchestration framework. Just Postgres being Postgres.

Semantic caching (goodbye Redis)

This one replaced my Redis instance. Instead of caching by exact query match, cache by meaning:

CREATE TABLE semantic_cache (
    id         BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    query      TEXT NOT NULL,
    embedding  vector(768) NOT NULL,
    response   TEXT NOT NULL,
    model      TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT now(),
    hit_count  INTEGER DEFAULT 0
);

CREATE INDEX idx_cache_hnsw
ON semantic_cache USING hnsw (embedding vector_cosine_ops);

Cache lookup:

SELECT response, 1 - (embedding <=> $query_embedding) AS similarity
FROM semantic_cache
WHERE 1 - (embedding <=> $query_embedding) > 0.95
ORDER BY embedding <=> $query_embedding
LIMIT 1;

"How do I create an HNSW index?" and "What's the syntax for building HNSW indexes?" hit the same cache entry. With Redis you'd need to normalize queries, handle synonyms, build some fuzzy matching layer. Here it's a distance calculation. Done.

Performance

Honest numbers. No cherry-picking.

Embedding generation with Ollama (nomic-embed-text): about 15ms per text, ~800ms for a batch of 100. Comparable to OpenAI's API, minus the network round-trip and the rate limit dance.

Vector search with HNSW at 1M rows: 3ms for top-5. With hybrid search (RRF added), closer to 8ms. Pinecone's documented p99 is ~20ms, and that's after your request traveled to their servers and back.

LLM generation with Ollama (llama3 8B): first token around 200ms, full response in 1-4 seconds. About what you'd expect from local inference.

End-to-end for the full pipeline: 2-5 seconds. Not instant, but for a RAG app where the user is waiting for a generated answer anyway? Fine.

What actually matters: less stuff to break

The performance numbers are nice but they're not the real reason I did this. The real reason is the table below, which I think about every time I see a "modern AI stack" diagram with nine boxes connected by arrows.

	Multi-service	Postgres only
Things to monitor	4-5	1
Data sync issues	constant	none
Backup	4 systems	pg_dump
Access control	per-service IAM	GRANT / RLS
Monthly cost (side project)	$70-200	$0
Monthly cost (production)	$500-2000	$50-200
2 AM pages	probable	rare

When the vector store, cache, document store and application DB are the same process, a whole class of bugs just doesn't exist. Stale embeddings because someone updated a document but the sync job hadn't run yet? Can't happen — same table. Cache serving answers from a version of the docs that no longer exists? Can't happen — same database. "Embedding service is down but everything else is fine"? Can't happen — there's no embedding service.

When this is a bad idea

I'm not going to pretend this works for everyone.

If you need billions of vectors and your query budget is 5ms at p99 globally — yeah, you probably want a dedicated vector database with geo-distributed replicas. If you're already deep in the Kubernetes ecosystem and adding a service is a Helm chart away, the operational cost argument weakens. If you're doing something exotic like cross-modal search across images and text with custom distance functions, specialized tools might make more sense.

But most of the RAG projects I've seen (and the ones people describe on r/LocalLLaMA and r/PostgreSQL) are not that. They're internal search tools, customer support bots, documentation assistants, content recommendation systems. Stuff that serves hundreds or thousands of users, not millions. For that, this approach is boring, reliable, and cheap. I like all three of those words.

Running it yourself

Two containers. That's the whole deployment.

services:
  postgres:
    image: timescale/timescaledb-ha:pg17
    environment:
      POSTGRES_DB: ai_db
      POSTGRES_PASSWORD: postgres
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/home/postgres/pgdata/data

  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama

volumes:
  pgdata:
  ollama:

docker compose up -d
docker exec -it ollama ollama pull nomic-embed-text
docker exec -it ollama ollama pull llama3
psql -h localhost -U postgres -d ai_db -c "CREATE EXTENSION vector; CREATE EXTENSION ai;"

No cloud account. No API key. No credit card.

Where to go from here

This covers the basics but there's a lot I skipped: chunking strategies (fixed-size vs semantic vs recursive — they all have tradeoffs and the right one depends on your documents), embedding versioning (what happens when a better model comes out and you need to re-embed a million rows without downtime), evaluation (how do you even know if your RAG is returning good answers), multi-tenant setups with row-level security, real-time pipelines with LISTEN/NOTIFY and CDC.

I wrote a book about all of this. It's called PostgreSQL for AI and it's 13 chapters covering vector search, RAG, feature engineering, in-database ML, real-time pipelines, and production deployment. Every example runs on Docker Compose.

There's a free sample chapter if you want to see whether my writing style annoys you before spending money.

tl;dr — pgvector stores vectors, pgai generates embeddings and calls LLMs, hybrid search beats pure vector search, semantic caching replaces Redis, and one database instead of five services means fewer things break at 2 AM. That's the whole post.

DEV Community