Yasser B.

Posted on Mar 27 • Originally published at rivestack.io

pgvector Cosine Distance: How <=> Actually Works

#postgres #pgvector #database

Introduction

When we first wired up pgvector for semantic search, I assumed <=> (cosine distance) was just "the normal one." Then a colleague asked why our similarity scores were sometimes above 1.0, and I realized I'd been cargo-culting the operator without understanding what it actually returns.

The pgvector cosine distance operator <=> is probably the most used distance function in similarity search — and the most misunderstood. Here's what's actually happening under the hood, when to use it, and when to switch.

What `<=>` Returns

<=> returns cosine distance, not cosine similarity. The relationship is:

cosine_distance = 1 - cosine_similarity

So two identical vectors return 0 (not 1), and two orthogonal vectors return 1. Two vectors pointing in opposite directions return 2.

-- Identical vectors: distance = 0
SELECT '[1,0,0]'::vector <=> '[1,0,0]'::vector;
-- Returns: 0

-- Opposite vectors: distance = 2
SELECT '[1,0,0]'::vector <=> '[-1,0,0]'::vector;
-- Returns: 2

-- Find the 5 nearest neighbors by cosine distance
SELECT id, content, embedding <=> '[0.1, 0.3, ...]'::vector AS distance
FROM documents
ORDER BY embedding <=> '[0.1, 0.3, ...]'::vector
LIMIT 5;

This trips people up constantly. If you're used to thinking "higher = more similar," you need to flip your mental model. With <=>, lower scores are better.

Three Operators, Three Use Cases

pgvector exposes three distance operators:

Operator	Metric	Best for
`<=>`	Cosine distance	NLP embeddings, semantic search
`<->`	L2 (Euclidean) distance	Image embeddings, dense numeric vectors
`<#>`	Negative inner product	When vectors are already unit-normalized

Use <=> when: your embeddings come from an NLP model (OpenAI, Cohere, Mistral, etc.) and you care about directional similarity rather than magnitude. Most language model embeddings encode meaning in the direction of the vector, not its length.

Use <-> when: magnitude matters. Image feature vectors or tabular embeddings where distance in absolute space is meaningful.

Use <#> when: your vectors are already L2-normalized (unit length). It returns the negative dot product, which is equivalent to cosine similarity for unit vectors — and it's faster because it skips the normalization step.

The Normalization Shortcut

If you pre-normalize your embeddings before inserting them, <#> is strictly faster than <=> and produces equivalent ranking:

-- Normalize on insert (Python)
import numpy as np

def normalize(vec):
    norm = np.linalg.norm(vec)
    return vec / norm if norm > 0 else vec

embedding = normalize(model.encode(text))

-- Then query with inner product
SELECT id, content, (embedding <#> query_vec) * -1 AS similarity
FROM documents
ORDER BY embedding <#> query_vec
LIMIT 10;

We benchmarked this on a 1M-row table with 1536-dim vectors. Pre-normalizing and using <#> cut query time by ~18% vs <=> on an IVFFlat index. Not huge, but free.

Index Compatibility

One thing that bites teams: your index operator class must match your query operator.

-- For cosine distance queries (<=>), use vector_cosine_ops
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- For L2 queries (<->), use vector_l2_ops
CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100);

If you create an ivfflat index with vector_l2_ops and then query with <=>, Postgres will do a full sequential scan. No error, no warning — just slow queries that silently skip the index. Run EXPLAIN and look for Index Scan vs Seq Scan to confirm your index is being used.

Picking Lists and probes

lists is the number of clusters IVFFlat partitions your data into. A good starting point is sqrt(row_count). At query time, ivfflat.probes controls how many clusters to search:

-- Set probes at session level (or per query)
SET ivfflat.probes = 10;

SELECT id, embedding <=> '[...]'::vector AS dist
FROM documents
ORDER BY dist
LIMIT 5;

Higher probes = better recall, slower queries. For most semantic search workloads, probes = 10 gives 95%+ recall at reasonable speed. Test with your actual data.

Wrapping Up

The pgvector cosine distance operator <=> is the right default for language model embeddings — but know what it returns (distance, not similarity), verify your index operator class matches your query operator, and consider pre-normalizing if you're chasing extra performance.

If you want pgvector running on managed Postgres without tuning kernel params or fighting storage I/O, Rivestack handles that — provisioned in 30 seconds with the pgvector extension pre-enabled.

DEV Community

pgvector Cosine Distance: How <=> Actually Works

Introduction

What `<=>` Returns

Three Operators, Three Use Cases

The Normalization Shortcut

Index Compatibility

Picking Lists and probes

Wrapping Up

Top comments (0)

Introduction

What <=> Returns

Three Operators, Three Use Cases

The Normalization Shortcut

Index Compatibility

Picking Lists and probes

Wrapping Up

What `<=>` Returns