DEV Community

Jovan Chan
Jovan Chan

Posted on • Originally published at aifoss.dev

pgvector vs Chroma vs Qdrant for Local RAG 2026

This article was originally published on aifoss.dev

TL;DR: If you already run PostgreSQL, pgvector (v0.8.2) is the right call — it's one SQL command away and handles 10M vectors comfortably. Chroma (1.5.9) exists for rapid prototyping and nothing else; it falls apart under concurrent load. Qdrant (v1.17.1) is the choice when you need native payload filtering, scalar quantization, or sub-20ms p95 latency at scale.

pgvector Chroma Qdrant
Best for Existing Postgres stacks, <10M vectors Prototypes, scripts, notebooks Production RAG with filtering, >1M vectors
Setup cost One CREATE EXTENSION if Postgres exists pip install chromadb, 3 lines One Docker command + client lib
The catch Performance degrades above 50M vectors Memory blowup, no sharding, prod risk New infra to manage; overkill for small datasets
Scalar quantization No (requires pgvectorscale add-on) No Yes — 4× memory savings
Filtered search WHERE clause (post-filter, slower) Python-side filtering Native payload index (pre-filter, fast)

Honest take: Use pgvector if you have Postgres. Use Qdrant if you're building anything that will see real users. Use Chroma only if you're experimenting in a notebook and know you'll replace it.


The Setup Reality

Every vector database tutorial starts with a pip install or docker run and immediately goes into embedding code. What they skip is the maintenance burden you're signing up for.

pgvector is a PostgreSQL extension. If you already operate Postgres for your application, adding pgvector means one SQL command and no new infrastructure. Your existing backup strategy, connection pooling, monitoring, and access control all carry over. If you don't have Postgres, you're now standing up a relational database just to use it as a vector store — which rarely makes sense.

Chroma is a Python library that ships a lightweight embedded server. Zero configuration, zero ports to open, data persists to disk via SQLite + HNSW files. For a developer who wants to test embedding strategies in an afternoon, it genuinely is the fastest path from idea to working code.

Qdrant is a standalone vector database written in Rust. It runs as a separate process (Docker is the recommended path), exposes REST and gRPC APIs, and requires a client library. That's one more moving part than pgvector and one more process to manage. In exchange, you get a purpose-built engine with features the other two simply don't have.


pgvector v0.8.2: Zero New Infrastructure

pgvector v0.8.2 was released in February 2026, patching CVE-2026-3172 (a buffer overflow during parallel HNSW index builds that could leak data or crash Postgres). If you're running an older version, upgrade before building any HNSW index in parallel.

Installation on an existing Postgres instance:

-- On Ubuntu/Debian with Postgres 16
sudo apt install postgresql-16-pgvector

-- Then in psql:
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table with a 1536-dim embedding column (OpenAI ada-002 / text-embedding-3-small)
CREATE TABLE documents (
    id     bigserial PRIMARY KEY,
    content text,
    embedding vector(1536)
);

-- Create HNSW index (recommended over IVFFlat for most workloads)
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
Enter fullscreen mode Exit fullscreen mode

Expected output after CREATE INDEX: the build runs in a background vacuum worker if you use CREATE INDEX CONCURRENTLY, or blocks the table if not. For 1M rows at 1536 dimensions, expect the index build to take 15–45 minutes on a mid-range server with maintenance_work_mem = 4GB.

The HNSW index memory trap: a 1M-row, 1536-dim HNSW index requires roughly 8–12GB of RAM to query efficiently. If your server doesn't have that in free memory, Postgres will page the index from disk and your query latency jumps from ~30ms to 300ms+. Set SET LOCAL hnsw.ef_search = 40; before queries to trade recall for speed when memory is tight.

When pgvector wins: your team already knows SQL, your data lives in Postgres, and you're not crossing the 50M vector threshold. Under that limit, HNSW in pgvector is competitive with purpose-built vector databases. The Timescale team's benchmarks with pgvectorscale (an optional add-on) show 471 QPS at 99% recall on 50M vectors — though vanilla pgvector without pgvectorscale is slower.

pgvector's hard limits: above 50M vectors, expect index builds to take 2+ hours and p95 query latency to drift above 200ms. There's no native quantization — every vector is stored as float32. Filtered similarity search (WHERE clause + <=> operator) performs post-filtering on HNSW results, which degrades recall significantly when your filter is selective.


Chroma 1.5.9: The Prototype Machine

Chroma 1.5.9 (May 2026) is the fastest way to get a RAG pipeline working. Three lines of Python:

import chromadb

client = chromadb.PersistentClient(path="./chroma_data")
collection = client.get_or_create_collection("docs")

# Add documents (Chroma can call your embedding model or accept pre-computed vectors)
collection.add(
    documents=["Self-hosted AI runs on your hardware", "No data leaves your machine"],
    ids=["doc1", "doc2"]
)

# Query
results = collection.query(query_texts=["local inference"], n_results=5)
print(results["documents"])
# [['Self-hosted AI runs on your hardware', 'No data leaves your machine']]
Enter fullscreen mode Exit fullscreen mode

That works. It persists to disk. You can add metadata filters. For a weekend project or internal tool under 100k documents, Chroma is genuinely fine.

The problems start when you move beyond that:

Memory: Chroma stores vectors as float32 with no native quantization option. A collection of 10 million 1536-dimension vectors occupies roughly 57GB of RAM (10M × 1536 × 4 bytes). That number isn't a gotcha — it's straightforward float math. Qdrant's INT8 scalar quantization brings the same dataset to ~15GB.

Single-process architecture: as of 1.5.9, Chroma has no sharding and no multi-node support. Concurrent queries compete for the same Python process. Community reports from production deployments describe memory leaks and crashes under sustained load. Chroma's own documentation recommends using the embedded mode for "development and testing."

When Chroma is right: notebooks, scripts, local experiments, RAG demos you're showing a colleague. The moment your app goes to more than one concurrent user, or your document count crosses 500k, reconsider. For something bigger, see AnythingLLM's multi-user RAG setup or migrate to Qdrant.


Qdrant v1.17.1: Production-First Design

Qdrant v1.17.1 (March 2026) is the most mature local vector database for production RAG. The core setup:

# Pull and run — data persists to ./qdrant_storage
docker run -d \
  --name qdrant \
  --restart unless-stopped \
  -p 6333:6333 \
  -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant:v1.17.1
Enter fullscreen mode Exit fullscreen mode

Then from Python using the official client:


python
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    # INT8 scalar quantization: ~4x memory savings, <1% recall loss
    quantization_config={
        "scalar": {"type": "int8", "always_ram": True}
    }
)

client.upsert(
    collection_name="docs",
    points=[
        PointStruct(
            id=1,
            vector=[0.1] * 1536,  # your actual embedding here
            payload={"source": "manual.pdf", "page": 3, "author": "alice"}
        )
    ]
)

# Filtered search: only docs from alice, on page > 2
results = client.search(
    collection_name="docs",
    query_vector=[0.1] * 1536,
    query_filter={"must": [{"key": "author", "match": {"value": "alice"}}]},
    lim
Enter fullscreen mode Exit fullscreen mode

Top comments (0)