Introduction
When we first wired up pgvector for semantic search, I assumed <=> (cosine distance) was just "the normal one." Then a colleague asked why our similarity scores were sometimes above 1.0, and I realized I'd been cargo-culting the operator without understanding what it actually returns.
The pgvector cosine distance operator <=> is probably the most used distance function in similarity search — and the most misunderstood. Here's what's actually happening under the hood, when to use it, and when to switch.
What <=> Returns
<=> returns cosine distance, not cosine similarity. The relationship is:
cosine_distance = 1 - cosine_similarity
So two identical vectors return 0 (not 1), and two orthogonal vectors return 1. Two vectors pointing in opposite directions return 2.
-- Identical vectors: distance = 0
SELECT '[1,0,0]'::vector <=> '[1,0,0]'::vector;
-- Returns: 0
-- Opposite vectors: distance = 2
SELECT '[1,0,0]'::vector <=> '[-1,0,0]'::vector;
-- Returns: 2
-- Find the 5 nearest neighbors by cosine distance
SELECT id, content, embedding <=> '[0.1, 0.3, ...]'::vector AS distance
FROM documents
ORDER BY embedding <=> '[0.1, 0.3, ...]'::vector
LIMIT 5;
This trips people up constantly. If you're used to thinking "higher = more similar," you need to flip your mental model. With <=>, lower scores are better.
Three Operators, Three Use Cases
pgvector exposes three distance operators:
| Operator | Metric | Best for |
|---|---|---|
<=> |
Cosine distance | NLP embeddings, semantic search |
<-> |
L2 (Euclidean) distance | Image embeddings, dense numeric vectors |
<#> |
Negative inner product | When vectors are already unit-normalized |
Use <=> when: your embeddings come from an NLP model (OpenAI, Cohere, Mistral, etc.) and you care about directional similarity rather than magnitude. Most language model embeddings encode meaning in the direction of the vector, not its length.
Use <-> when: magnitude matters. Image feature vectors or tabular embeddings where distance in absolute space is meaningful.
Use <#> when: your vectors are already L2-normalized (unit length). It returns the negative dot product, which is equivalent to cosine similarity for unit vectors — and it's faster because it skips the normalization step.
The Normalization Shortcut
If you pre-normalize your embeddings before inserting them, <#> is strictly faster than <=> and produces equivalent ranking:
-- Normalize on insert (Python)
import numpy as np
def normalize(vec):
norm = np.linalg.norm(vec)
return vec / norm if norm > 0 else vec
embedding = normalize(model.encode(text))
-- Then query with inner product
SELECT id, content, (embedding <#> query_vec) * -1 AS similarity
FROM documents
ORDER BY embedding <#> query_vec
LIMIT 10;
We benchmarked this on a 1M-row table with 1536-dim vectors. Pre-normalizing and using <#> cut query time by ~18% vs <=> on an IVFFlat index. Not huge, but free.
Index Compatibility
One thing that bites teams: your index operator class must match your query operator.
-- For cosine distance queries (<=>), use vector_cosine_ops
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- For L2 queries (<->), use vector_l2_ops
CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100);
If you create an ivfflat index with vector_l2_ops and then query with <=>, Postgres will do a full sequential scan. No error, no warning — just slow queries that silently skip the index. Run EXPLAIN and look for Index Scan vs Seq Scan to confirm your index is being used.
Picking Lists and probes
lists is the number of clusters IVFFlat partitions your data into. A good starting point is sqrt(row_count). At query time, ivfflat.probes controls how many clusters to search:
-- Set probes at session level (or per query)
SET ivfflat.probes = 10;
SELECT id, embedding <=> '[...]'::vector AS dist
FROM documents
ORDER BY dist
LIMIT 5;
Higher probes = better recall, slower queries. For most semantic search workloads, probes = 10 gives 95%+ recall at reasonable speed. Test with your actual data.
Wrapping Up
The pgvector cosine distance operator <=> is the right default for language model embeddings — but know what it returns (distance, not similarity), verify your index operator class matches your query operator, and consider pre-normalizing if you're chasing extra performance.
If you want pgvector running on managed Postgres without tuning kernel params or fighting storage I/O, Rivestack handles that — provisioned in 30 seconds with the pgvector extension pre-enabled.
Top comments (0)