Gabriel Anhaia

Posted on May 24

Vector DB Choice 2026: pgvector vs Qdrant vs Pinecone vs Weaviate: The Real Trade Matrix

#rag #ai #vectordatabase #architecture

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picking a vector DB in 2026 is harder than it was in 2024, not easier. The four serious contenders all matured. Their cost shapes diverged. Their operational stories diverged more.

You can't read a benchmark blog post and pick a winner anymore. The honest answer depends on your corpus size, your filter shape, who runs your infra, and what your CFO will sign for next year. Here's the matrix that actually survives that conversation.

Why the "best vector DB" question has the wrong frame

The benchmark posts keep asking the wrong thing. "Which DB has the lowest p95 latency at 10M vectors?" is a question with no consequence. At 10M vectors all four are fast enough. The questions that matter look more like:

What happens to your bill when traffic 5×'s on a Tuesday?
Can you do WHERE tenant_id = ? AND created_at > ? without killing recall?
If your vendor doubles prices in 2027, how many engineer-weeks does migration cost?
Who pages at 3am when the index corrupts?

Those questions sort the four candidates into very different buckets. Let's go through them.

pgvector: your existing Postgres, the cheap default, the recall trap at 50M+ rows

If you already run Postgres and you're under ~10M chunks, stop reading and use pgvector. The math is brutal: zero new infra, your existing backups cover the vectors, your existing observability covers the queries, your existing access-control story covers the data. You don't add a service to the on-call rotation. That alone is worth a generation of recall.

The trap is what happens when you grow past 10M and start tuning. pgvector ships two index types and they behave nothing like each other.

IVFFlat clusters vectors into lists and searches the nearest probes clusters at query time. It builds fast, it queries fast, and its recall is mediocre unless you tune both knobs against your data. Worse, you have to rebuild the index when your data distribution shifts, because the cluster centroids drift.

HNSW builds a navigable graph. Recall is much higher out of the box, queries are faster at high recall, and you don't have to rebuild when data shifts. The price is build time (slow) and memory (the whole graph lives in RAM if you want it fast).

-- IVFFlat: cheap to build, mediocre recall, needs probes tuning
CREATE INDEX chunks_embedding_ivfflat
  ON chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 1000);  -- rule of thumb: rows / 1000 for <1M, sqrt(rows) above

-- query side
SET ivfflat.probes = 10;  -- higher = better recall, slower

-- HNSW: slow to build, high recall, memory hungry
CREATE INDEX chunks_embedding_hnsw
  ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

SET hnsw.ef_search = 40;  -- higher = better recall, slower

The 50M+ trap: people pick IVFFlat because the docs make it sound symmetric with HNSW, and at 50M vectors recall craters unless probes goes high enough that p95 latency triples. Then they blame "pgvector doesn't scale" when the real answer is "IVFFlat doesn't scale, switch to HNSW or shard the table."

Default for new pgvector deployments in 2026: HNSW with m=16, ef_construction=64, tune ef_search per query class. Use IVFFlat only if your RAM budget can't hold the HNSW graph.

Qdrant: open source, self-host or cloud, best-in-class filtering

Qdrant earns its slot because of one thing: payload filtering that doesn't destroy recall. Every other vector DB pre-filters or post-filters, both of which have failure modes. Pre-filter on a high-cardinality field and your HNSW graph degenerates. Post-filter and you lose recall when the filter is selective.

Qdrant indexes payload fields and integrates them into the HNSW traversal. You can filter on tenant, on date range, on tags, on numeric ranges, and the index does the right thing.

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient(url="http://localhost:6333")

# create collection with payload index on tenant_id and created_at
client.create_collection(
    collection_name="docs",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE),
)
client.create_payload_index("docs", "tenant_id", models.PayloadSchemaType.KEYWORD)
client.create_payload_index("docs", "created_at", models.PayloadSchemaType.INTEGER)

# query: filter is part of the index walk, not a post-filter
hits = client.search(
    collection_name="docs",
    query_vector=embedding,
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="tenant_id",
                match=models.MatchValue(value="acme-corp"),
            ),
            models.FieldCondition(
                key="created_at",
                range=models.Range(gte=1735689600),  # posts from 2025-01-01
            ),
        ]
    ),
    limit=10,
)

The cloud tier is reasonable. The self-host story is genuinely production-ready: a single binary, clear cluster docs, snapshot/restore that works. If you have a platform team that's comfortable running stateful services, self-host. If you don't, Qdrant Cloud's per-node pricing is predictable in a way Pinecone's serverless isn't.

The gotcha: Qdrant's quantization (scalar, binary, product) is excellent for memory but you must benchmark recall on your own data before flipping it on in prod. Binary quantization can cut RAM 32× but for some embedding models recall drops a few points. That's enough to matter for legal or medical retrieval, fine for product search.

Pinecone: managed, the serverless tier changed the cost math in 2025

Pinecone's serverless launch in 2024 reshaped the conversation. Before, Pinecone was "managed but you pay for pod-hours whether you query or not." After, Pinecone is "you pay per million read units and per GB stored, idle costs nothing."

For variable load that's a different product. A B2B app with 200 customers and bursty usage no longer pays for 24/7 pods. A weekly batch retrieval pipeline pays for the four hours it runs.

The pricing model has three knobs:

Storage: ~$0.33/GB/month for serverless indexes
Read units: charged per query, scales with index size and top_k
Write units: charged per upsert/delete

The trap is that read units scale with index size. At 10M vectors you don't notice. At 500M vectors with high QPS, the read-unit bill can exceed what a self-hosted Qdrant cluster would cost on the same load. There's a crossover around 100M vectors / sustained QPS where the math flips and pod-based pricing (or self-host) wins.

Run the napkin math before you commit to serverless at scale:

monthly_cost ≈ storage_gb * 0.33
            + (queries_per_month * read_units_per_query * unit_price)
            + (writes_per_month * write_units_per_write * unit_price)

read_units_per_query is documented per index size tier. Pull your projected QPS, your projected index size in 12 months, and your projected growth. Then compare against three quotes: Qdrant Cloud, self-hosted Qdrant on managed Kubernetes, and Pinecone serverless. The cheapest answer depends on the shape of your traffic, not the brand.

The other Pinecone reality: lock-in is real. The API surface is proprietary, the filtering syntax is proprietary, the metadata model is proprietary. Migrating off Pinecone is a 2-4 week project for any non-trivial app. Price that into your decision.

Weaviate: open source, hybrid retrieval built in, the GraphQL surface decision

Weaviate's pitch is hybrid retrieval as a first-class concept: vector + BM25 + filters fused at query time, with the alpha parameter controlling the mix. If your retrieval task benefits from keyword + semantic (most do, since exact product codes, error messages, and names of people all matter), Weaviate gives you the fused result without bolting two systems together.

from weaviate.classes.query import HybridFusion

response = collection.query.hybrid(
    query="postgres connection pool exhausted",
    alpha=0.6,  # 0=pure BM25, 1=pure vector, 0.6 leans semantic
    fusion_type=HybridFusion.RELATIVE_SCORE,
    limit=10,
    return_metadata=["score", "explain_score"],
)

The hybrid score with explain_score is genuinely useful for debugging "why did this chunk rank where it did." You see the BM25 contribution and the vector contribution separately. That's hard to get from a pgvector + Elasticsearch sidecar setup without writing it yourself.

The decision around Weaviate is its surface area. It exposes both GraphQL and REST. The GraphQL surface is expressive and verbose; the REST surface is newer and cleaner. Some teams find GraphQL overkill for a vector DB. Others love it for the type-safety it enables in clients. Try both surfaces before committing your codebase to one.

Self-host story is solid. Weaviate Cloud is reasonably priced. The community is active. The thing to watch is module compatibility: Weaviate has a lot of plug-in modules (text2vec-openai, generative-cohere, etc.) and version-skew between modules and the core has bit teams during upgrades. Pin module versions.

The 6-row trade matrix

Dimension	pgvector	Qdrant	Pinecone	Weaviate
Cost shape	Free (you already pay for Postgres)	Self-host: infra only · Cloud: per-node	Serverless: pay-per-query · Pod: pay-per-hour	Self-host: infra · Cloud: per-resource
Latency at 10M	20-80ms (HNSW)	10-30ms	30-80ms (serverless cold)	15-40ms
Scale ceiling	~50-100M with HNSW + tuning	1B+ with sharding	1B+ (cost scales linearly)	500M+ with sharding
Filtering	SQL `WHERE`, full power, watch recall on pre/post filter	Best-in-class, integrated with HNSW walk	Good but proprietary metadata model	Good, integrated with hybrid
Hybrid (vector + BM25)	DIY (combine with `tsvector`)	Sparse vectors supported, manual fusion	Sparse vectors supported, manual fusion	First-class, single API call
Lock-in	Zero, it's just Postgres	Low: open source, standard distance metrics	High: proprietary API and metadata	Low-medium: open source, GraphQL surface

The matrix doesn't pick a winner because there isn't one. It tells you which dimensions to weigh.

The decision tree by scale

Under 10M chunks: pgvector. The cost of adding a vector DB service is not worth the marginal latency improvement. Use HNSW. Move on.

10M to 100M chunks: Real fork. Three reasonable answers:

Stay on pgvector if your team's center of gravity is Postgres, your filtering is SQL-shaped, and you can throw RAM at the HNSW index. Plenty of teams run pgvector at 50-80M with care.
Move to Qdrant if filtering is your main concern, you have platform engineers, and you want a single binary you can reason about.
Move to Pinecone serverless if your load is bursty, you have no platform engineers, and your finance team prefers usage-based pricing to fixed infra.

100M+ chunks: The serverless pricing curve gets steep. Self-hosted Qdrant or Weaviate (or a pod-based Pinecone tier with reserved pricing) usually wins. At this scale you also need to think about sharding strategy, replication, snapshot/restore, and cross-region. That's real infra work regardless of vendor.

The gotcha: migration cost between vector DBs is real

The biggest mistake teams make in 2026 is not the choice itself. It's not planning for the migration.

You will migrate at least once in the lifetime of a serious RAG app. The vendor pricing will change. The embedding model will change. A new DB will appear that's measurably better for your shape. You will need to move.

The migration tax is dominated by two things:

Re-embedding if you switch embedding models mid-flight (you usually do, because vendor model choices drift).
Re-indexing the source corpus into the new DB's API surface, metadata model, and ID scheme.

You can't avoid the work, but you can avoid doing it twice. The pattern that pays for itself: store your canonical embeddings in object storage (S3, R2, GCS) in a portable format, separate from the vector DB. Treat the vector DB as a derived index, not the source of truth.

# canonical record on S3, survives any vendor change
{
  "chunk_id": "doc_4821_chunk_07",
  "tenant_id": "acme-corp",
  "source_doc_id": "doc_4821",
  "text": "The connection pool exhausted error usually...",
  "embedding_model": "text-embedding-3-large",
  "embedding_version": "2025-01",
  "embedding": [0.0123, -0.0456],  # 3072 floats in real records
  "metadata": {
    "created_at": 1735689600,
    "url": "https://docs.example.com/troubleshooting",
    "tags": ["postgres", "production"]
  }
}

Write a small loader per vector DB. The Pinecone loader reads from S3 and upserts. The Qdrant loader reads from S3 and upserts. The pgvector loader reads from S3 and COPYs. Migration becomes "run the new loader against the corpus" instead of "export from old vendor in their proprietary format, transform, reimport."

The S3-portability pattern also fixes another problem: when you change embedding models, you re-embed from text and overwrite the embedding field on S3, then re-run the loader. The vector DB is downstream of the truth, not the truth itself.

This is the single architecture decision that pays back the most over a 3-year horizon, regardless of which DB you pick today.

What to do this week

Don't pick a vector DB based on a feature matrix you read on Reddit. Pick one based on:

Where your corpus is in the decision tree above.
Whether your filtering needs are "tenant_id and date range" (most apps) or "complex multi-field with high cardinality" (Qdrant territory).
Whether your load is steady (pod-based wins) or bursty (serverless wins).
Whether you have platform engineers (self-host options open up) or not (managed only).

Then build the S3 portability layer on day one. Future-you will send a thank-you card when you switch vendors in 2028.

What's running your vector workload today, and what's the one thing about it you'd change if you started over? Drop it in the comments.

If this was useful

Vector DB choice is one chapter of a bigger story: how the whole RAG stack (chunking, embeddings, retrieval, reranking, the index layer) fits together under production load. The RAG Pocket Guide walks through that stack end to end, with the retrieval and reranking chapters going deep on the trade-offs above and the failure modes you only see at scale.