Philip McClarence

Posted on Mar 1

pgvector Distance Functions: Cosine vs L2 vs Inner Product

#database #machinelearning #postgres #sql

pgvector Distance Functions: Cosine vs L2 vs Inner Product

If you're using pgvector for similarity search, you've probably seen the three distance operators -- <=>, <->, and <#> -- and picked one based on whatever tutorial you followed first. That works until it doesn't. Using the wrong distance function produces subtly wrong results: your queries return rows, ordered by some number, but the ranking doesn't match what your application actually needs. And the most common mistake isn't even picking the wrong function -- it's building an index with one operator class while your queries use a different operator.

The Three Distance Functions

pgvector provides three ways to measure how similar two vectors are:

<=> Cosine distance measures the angle between vectors. It ignores magnitude entirely -- two vectors pointing the same direction have cosine distance 0 regardless of length. Range: 0 to 2. This is the standard choice for normalized embeddings from modern LLMs (OpenAI, Cohere, Voyage).

<-> L2 (Euclidean) distance measures straight-line distance in vector space. It is sensitive to magnitude. Range: 0 to infinity. Use this when your data represents physical positions, sensor readings, or anything where "how far apart" matters.

<#> Inner product (negative) returns the negated dot product so ORDER BY returns highest similarity first. Sensitive to both angle and magnitude. Used primarily for maximum inner product search in recommendation systems.

The Operator/Index Mismatch Problem

This is the mistake that costs the most performance. pgvector indexes are built with a specific operator class:

vector_cosine_ops accelerates <=>
vector_l2_ops accelerates <->
vector_ip_ops accelerates <#>

Each operator class only accelerates its corresponding operator. If your index uses vector_cosine_ops but your query uses <-> (L2), PostgreSQL cannot use the index. It falls back to a sequential scan, computing distances row by row. On a million-row table, that's 5ms vs 30 seconds.

No error is raised. The query works. It's just catastrophically slow.

Detecting the Mismatch

Check which operators your queries use and compare with your index definitions:

-- Find which distance operators your queries use
SELECT
    query,
    calls,
    mean_exec_time
FROM pg_stat_statements
WHERE query LIKE '%<->%'   -- L2 distance
   OR query LIKE '%<=>%'   -- cosine distance
   OR query LIKE '%<#>%'   -- inner product (negative)
ORDER BY calls DESC
LIMIT 20;

-- Check if index operator class matches query operator
SELECT indexname, indexdef
FROM pg_indexes
WHERE indexdef LIKE '%vector%'
ORDER BY indexname;

If your index shows vector_cosine_ops but your queries use <->, you have a mismatch.

Confirm with EXPLAIN:

EXPLAIN (ANALYZE, BUFFERS)
SELECT id, embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM docs
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

If the plan shows Seq Scan on a table that has a vector index, the operator doesn't match the index.

Fixing It

Ensure every distance operator has a matching index with the correct operator class:

-- Cosine distance (<=>): best for normalized embeddings (most LLMs)
CREATE INDEX idx_cosine ON docs USING hnsw (embedding vector_cosine_ops);
SELECT * FROM docs ORDER BY embedding <=> query_vec LIMIT 10;

-- L2 distance (<->): best for spatial/positional data
CREATE INDEX idx_l2 ON docs USING hnsw (embedding vector_l2_ops);
SELECT * FROM docs ORDER BY embedding <-> query_vec LIMIT 10;

-- Inner product (<#>): best for MaxIP retrieval (note: returns negative)
CREATE INDEX idx_ip ON docs USING hnsw (embedding vector_ip_ops);
SELECT * FROM docs ORDER BY embedding <#> query_vec LIMIT 10;

If you discover a mismatch, either change your queries to match the index or rebuild the index:

DROP INDEX idx_wrong_ops;
CREATE INDEX idx_correct_ops ON docs USING hnsw (embedding vector_cosine_ops);

When to Use Each

Distance Function	Operator	Best For	Index Operator Class
Cosine	`<=>`	Normalized embeddings from LLMs	`vector_cosine_ops`
L2 (Euclidean)	`<->`	Spatial data, sensor readings	`vector_l2_ops`
Inner Product	`<#>`	MaxIP search, recommendations	`vector_ip_ops`

The simple rule: if you're using embeddings from a modern LLM and aren't sure, start with cosine distance (<=>). It works correctly whether or not vectors are normalized, and it's the ecosystem standard.

Preventing Future Mismatches

Pick one distance function per column and standardize. Mixing operators on the same column means only one can be index-accelerated (unless you create multiple indexes, doubling storage).
Document your convention. The mismatch usually happens when a developer copies a query example from a blog post that uses a different operator.
Add an EXPLAIN check to your test suite for critical similarity queries. If the plan doesn't show an index scan, fail the test.
Verify operator/index alignment after every migration that touches vector columns or indexes.

-- Quick check: what operator should each index's queries use?
SELECT
    i.indexname,
    i.indexdef,
    CASE
        WHEN i.indexdef LIKE '%cosine_ops%' THEN '<=>'
        WHEN i.indexdef LIKE '%l2_ops%' THEN '<->'
        WHEN i.indexdef LIKE '%ip_ops%' THEN '<#>'
    END AS expected_operator
FROM pg_indexes i
WHERE i.indexdef LIKE '%vector%';

Run this after every deployment that touches vector infrastructure. If the expected operator doesn't match what your application queries use, fix it before it hits production.

Originally published at mydba.dev/blog/pgvector-distance-functions

Top comments (1)

AutoJanitor • Mar 1

The operator/index mismatch section is gold — this is the kind of silent performance bug that can lurk in production for months. "No error is raised. The query works. It's just catastrophically slow." Perfect summary of the problem.

We ran into similar distance function decisions building a persistent memory system for AI agents at Elyan Labs. We use SQLite with sqlite-vec (634+ memories, semantic retrieval) backing Claude Code sessions, and cosine similarity was the right default for our normalized embeddings. But we discovered that for "conversational memory" queries — where you want memories about a topic rather than memories matching exact phrasing — the distance metric choice matters more than expected. Cosine works great for finding near-duplicates but can miss semantically related content that sits at a wider angle in embedding space.

One thing worth mentioning for anyone reading: if you're using pgvector with HNSW indexes, the ef_search parameter also dramatically affects recall quality at a given distance function. We found that bumping ef_search from default 40 to 200 improved our retrieval relevance more than switching distance functions did — at the cost of ~3x query time (still sub-10ms for our corpus).

The EXPLAIN-in-test-suite suggestion is particularly smart. We should add that to our CI.