Franck Pachot

Posted on Jun 22 • Edited on Jun 26

Vector Search with Filters: pgvector vs DiskANN on HorizonDB

#postgres #sql #vector #database

In a previous post, I explained that using filtering with pgvector can decrease recall in approximate nearest neighbor (ANN) searches.

This post repeats the same experiment and dataset but compares two methods:

pgvector with HNSW, a popular PostgreSQL extension
HorizonDB with DiskANN, Microsoft’s vector index

The goal is to understand what happens when similarity search is combined with filtering.

Both setups were tested on a HorizonDB instance with 2 vCores and 16 GiB RAM in Azure public preview, where I activated the extensions by adding them to azure.extensions.

In real applications, SQL statements without a WHERE clause are rare, and typical queries combine a filter alongside a similarity search:

SELECT *
 FROM embeddings
 WHERE tenant_id = 42
 ORDER BY embedding <=> :vector_query
 LIMIT 10;

You don’t search the entire dataset. Instead, you focus on a specific subset, such as one tenant, recent data, or a category.

Experiment Setup

As in the previous post, I inserted 2 million rows of 512‑dimensional vectors with a simple metadata column (color, evenly distributed):

CREATE TABLE embeddings_table (
  id bigserial PRIMARY KEY,
  color text,
  embedding vector(512)
);

I generated random vectors to insert the embeddings, and three values for color:


create function random_embedding(dimensions int) returns vector as
$$
select
    array(
        select random()::real
        from generate_series(1, dimensions)
    )::vector
$$ language sql;

insert into embeddings_table (embedding,color)
 select random_embedding(512)
  ,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;

This uniform distribution is intentional: it isolates the effect of filtering.

pgvector (HNSW)

Here's the HNSW index that I created in my previous post:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE INDEX ON embeddings_table 
 USING hnsw (embedding vector_cosine_ops)
;

It can take some time. Here is the query I use to monitor progress:

SELECT
  command,
  phase,
  CASE WHEN blocks_total > 0 THEN round(100.0 * blocks_done / blocks_total, 1) ELSE 0 END AS blocks_pct,
  CASE WHEN tuples_total > 0 THEN round(100.0 * tuples_done / tuples_total, 1) ELSE 0 END AS tuples_pct,
  relid::regclass AS table_name,
  index_relid::regclass AS index_name
FROM pg_stat_progress_create_index
;

I might have raised the maintenance_work_mem as indicated by the NOTICE after index creation — it took more than one hour.


NOTICE: hnsw graph no longer fits into maintenance_work_mem after 195123 tuples

CREATE INDEX

Total execution time: 01:20:03.861

I generate another embedding for my query, which I'll store as a psql variable with \gset:

select random_embedding(512) as query
\gset

Without index - Seq Scan and Exact Nearest Neighbor Search

I checked the behavior before the index is created, without using a vector index:

--EXPLAIN (ANALYZE, BUFFERS, VERBOSE off, COSTS off)
SELECT id , color, embedding <=> :'query' nn_cosine
 FROM embeddings_table 
 WHERE color = 'red'
 ORDER BY embedding <=> :'query'
 LIMIT 15
;

Without a vector index, pgvector performs an exact nearest neighbors (ENN) search:

postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
 FROM embeddings_table
 WHERE color = 'red'
 ORDER BY embedding <=> :'query'
 LIMIT 15
;

   id    | color |      nn_cosine
---------+-------+---------------------
 1370302 | red   |  0.2639440793770883
 1048027 | red   | 0.23483715071295053
  804529 | red   | 0.24173628441472828
 1013860 | red   | 0.25945607630683887
 1011571 | red   | 0.23552944123421315
 1514713 | red   |  0.2324248794732996
 1411876 | red   | 0.21627272961247324
 1743091 | red   | 0.25835400937319386
  494461 | red   | 0.23541082932461943
 2128600 | red   |  0.2531862056756401
 2118649 | red   | 0.27055299026445434
 2151097 | red   | 0.25835806753276225
 1988278 | red   |  0.2341769026974998
 1259950 | red   | 0.21806901991480443
 1553203 | red   | 0.24358174125683563
(15 rows)

The execution plan indicates that the color = 'red' filter was applied to the scan result, and the distance was computed from the filtered data to sort and return the top-15.

 Limit (actual time=4088.254..4164.226 rows=15 loops=1)
   Buffers: shared hit=5348519
   ->  Gather Merge (actual time=4088.252..4164.222 rows=15 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=5348519
         ->  Sort (actual time=4058.895..4058.897 rows=11 loops=3)
               Sort Key: ((embedding <=> '[0.51098114,0.8720485,0.8568168,...,0.110820755]'::vector))
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=5348519
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on embeddings_table (actual time=0.127..3949.246 rows=222222 loops=3)
                     Filter: (color = 'red'::text)
                     Rows Removed by Filter: 444445
                     Buffers: shared hit=5348445
 Planning Time: 0.078 ms
 Execution Time: 4164.252 ms
(18 rows)

This approach works when the filter is sufficiently selective to enable distance calculations and sorting on a small set of rows. However, with larger or expanding data sets, a quick neighbor search requires an index that can efficiently find neighbors.

With HNSW index - Approximate Nearest Neighbor Index Scan

Once the index is created, the query runs faster, but the result, compared to the previous execution, is incomplete:

postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
 FROM embeddings_table
 WHERE color = 'red'
 ORDER BY embedding <=> :'query'
 LIMIT 15
;
   id    | color |      nn_cosine
---------+-------+---------------------
 1259950 | red   | 0.21806901991480443
  744973 | red   | 0.21565428393281894
  550210 | red   |  0.2085433809181183
 1860280 | red   | 0.24081963623331726
 1794061 | red   | 0.22253304456085543
 1056052 | red   |   0.232418043420476
 1247836 | red   | 0.22905966001837208
 1128376 | red   | 0.24222950580938118
  847906 | red   | 0.24276414827559645
 1782067 | red   |   0.233356563767256
 2023984 | red   | 0.24263831933162294
 1438981 | red   | 0.21360709574533177
(12 rows)

This is what happened:

The HNSW index retrieves a fixed number of candidates
PostgreSQL applies the WHERE filter afterward
Only matching rows are returned, which can result in fewer rows than requested

For example, if the index examines about 40 candidates (ef_search = 40, the default) and the filter matches one third of the rows, we would expect about 13 matching rows on average. This is approximate because the index examines a local region of the vector space, where the metadata distribution may differ from the global distribution.

In this execution plan, 40 candidates were visited (12 returned, 28 filtered out), but only 12 matched the predicate, falling short of the query's LIMIT 15. These candidates are not necessarily the closest neighbors that satisfy the filter. The true nearest neighbors with color = 'red' might lie outside the explored region and were therefore not considered, because the region included other colors.

In this controlled case, where vectors are uniformly distributed, the ANN result is not only smaller but also deviates significantly from the ENN baseline, with very little overlap between the returned rows and the true nearest neighbors in the filtered subset.

In the execution plan, this is evident as the condition on color appears as a Filter applied after the index scan:

 Limit (actual time=2.352..2.496 rows=12 loops=1)
   Buffers: shared hit=1775
   ->  Index Scan using i_hnsw on embeddings_table (actual time=2.351..2.493 rows=12 loops=1)
         Order By: (embedding <=> '[0.51098114,0.8720485,0.8568168,...,0.110820755]'::vector)
         Filter: (color = 'red'::text)
         Rows Removed by Filter: 28
         Buffers: shared hit=1775
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.054 ms
 Execution

The execution plan details show what happened: 12 returned + 28 filtered = 40 candidates, which is ef_search.

HNSW applies filtering after traversal, meaning the graph is searched globally before rows are discarded. This breaks the assumption that the closest neighbors are among the candidates returned by the index, especially when using selective filters, as they often are not.

The mitigations include:

Increasing ef_search. For instance, based on our data distribution, setting hnsw.ef_search = 200 could provide enough candidates so that at least 15 remain after filtering.
Using partial indexes when filtering on a small set of discrete, known values, so each is indexing a specific subset.
Enabling iterative scans to expand the search scope when filtering removes too many candidates. This improves recall but requires multiple scans.

These measures improve recall, though they come with higher costs and require manual calibration because filtering is handled separately from the ANN search.

DiskANN on HorizonDB

Now I create a DiskANN index, designed for filtering during the search:

postgres=> CREATE EXTENSION IF NOT EXISTS pg_diskann;

CREATE EXTENSION

postgres=> CREATE INDEX i_diskann ON embeddings_table
           USING diskann (embedding vector_cosine_ops)
;

CREATE INDEX

Total execution time: 00:42:40.800

In HorizonDB documentation, it is explained that DiskANN enables advanced filtering for combined vector and metadata queries. Depending on selectivity and planner estimates, it can apply filters during traversal or fall back to post-filtering

With DiskANN index - Approximate Nearest Neighbor Index Scan

After creating the index, I run the same query:

postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
 FROM embeddings_table
 WHERE color = 'red'
 ORDER BY embedding <=> :'query'
 LIMIT 15
;

NOTICE:  pg_diskann: Filter selectivity too high (0.3325), skipping filtered vector scan

   id    | color |      nn_cosine
---------+-------+---------------------
 1968337 | red   | 0.24740531343440575
  545257 | red   | 0.22958854707242105
  828484 | red   | 0.24054936837179086
  744973 | red   | 0.21565428393281894
  307819 | red   | 0.23955190885002464
  557251 | red   |  0.2158213914722017
 1860280 | red   | 0.24081963623331726
 1794061 | red   | 0.22253304456085543
 1048324 | red   | 0.24138402291179228
  494461 | red   | 0.23541082932461943
  851149 | red   | 0.26081572380948004
  832813 | red   |  0.2240000332741019
  745516 | red   | 0.24130828891293765
  233791 | red   |  0.2324244758968873
 1247836 | red   | 0.22905966001837208
(15 rows)

The note is interesting: pre-filtering was not used. Still, the result is complete because DiskANN dynamically adapts the number of candidates explored.

This means the filter was not applied during traversal, so the query falls back to post-filtering, as with pgvector. However, unlike HNSW, the number of candidates is not fixed: the engine explores more vectors until enough rows are returned.

In the execution plan, we see that 42 rows were filtered out and 15 were returned, meaning about 57 candidates were explored:

 Limit (actual time=6.607..10.821 rows=15 loops=1)
   Buffers: shared hit=5772
   ->  Index Scan using i_diskann on embeddings_table (actual time=6.606..10.817 rows=15 loops=1)
         Order By: (embedding <=> '[0.51098114,0.8720485,0.8568168,0.3119626,...,0.110820755]'::vector)
         Filter: (color = 'red'::text)
         Rows Removed by Filter: 42
         Buffers: shared hit=5772
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.104 ms
 Execution Time: 10.893 ms
(11 rows)

By varying the LIMIT, we can see that even without pre-filtering, the results are complete and remain fast for common top-n queries:

LIMIT	Rows Removed by Filter	rows=	Buffers	Execution Time
1	1	1	3779	4.215 ms
2	1	2	3790	6.431 ms
5	22	5	4613	8.153 ms
10	29	10	5310	9.841 ms
15	42	15	5722	10.893 ms
100	222	100	15364	39.989 ms
1000	1879	1000	91585	826.303 ms
10000	20202	10000	602407	27827.608 ms

This distinction is crucial: pgvector uses a fixed search budget (ef_search), whereas DiskANN adjusts the number of explored candidates. Even with post-filtering, it can provide enough rows without manual tuning and without incremental scans.

Highly selective query - Exact Nearest Neighbor

If the filter is highly selective, there is no need to use an approximate search:

postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
 FROM embeddings_table
 WHERE color = 'red' and id = 494461
 ORDER BY embedding <=> :'query'
 LIMIT 15
;

NOTICE:  pg_diskann: Filter selectivity too low (0.0000), using filter-only scan

   id   | color |      nn_cosine
--------+-------+---------------------
 494461 | red   | 0.23541082932461943
(1 row)

The execution plan shows that it didn't use the vector index but instead used the B-tree index on id with an additional Sort for exact nearest-neighbor ordering:

 Limit (actual time=0.038..0.039 rows=1 loops=1)
   Buffers: shared hit=12
   ->  Sort (actual time=0.037..0.038 rows=1 loops=1)
         Sort Key: ((embedding <=> '[0.51098114,0.8720485,0.8568168,0.311962,...,0.110820755]'::vector))
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=12
         ->  Index Scan using embeddings_table_pkey on embeddings_table (actual time=0.033..0.034 rows=1 loops=1)
               Index Cond: (id = 494461)
               Filter: (color = 'red'::text)
               Buffers: shared hit=12
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.140 ms
 Execution Time: 0.053 ms
(14 rows)

DiskANN doesn’t even participate when the filtering is highly selective. The planner uses a B-tree index and bypasses ANN.

With DiskANN index - Pre-filtered Approximate Nearest Neighbor

When filters are selective enough to be applied during the scan but not enough to avoid ANN search, there's no NOTICE:

postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
 FROM embeddings_table
 WHERE color = 'red' and id <= 494461
 ORDER BY embedding <=> :'query'
 LIMIT 15
;

   id   | color |      nn_cosine
--------+-------+---------------------
 418054 | red   | 0.21522295998337881
 233791 | red   |  0.2324244758968873
 377470 | red   | 0.23508242287803505
 196009 | red   |  0.2356772244494152
 225253 | red   | 0.24119501226879858
 233026 | red   | 0.23055577367239388
 308998 | red   | 0.21652541614338927
 291283 | red   |  0.2319492505072145
 459442 | red   |  0.2448930593940325
 296956 | red   | 0.21969539585363562
 223876 | red   | 0.24664282569002072
 385669 | red   | 0.21339531139470636
 390397 | red   |  0.2484855378229266
 296467 | red   | 0.22426404565684388
 220285 | red   |  0.2179154018451408
(15 rows)

This is where the execution plan displays a Custom Scan (DiskANNFilteredScan), enabled by PostgreSQL's extensibility. It performs pre-filtering (Filter(IndexScan)) and Approximate Nearest Neighbor search (Vector) to yield the precise number of candidates in the filtered set (Rows Retrieved: 15 count):

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Limit (actual time=25.842..28.876 rows=15 loops=1)
   Buffers: shared hit=11905
   ->  Custom Scan (DiskANNFilteredScan) (actual time=25.841..28.872 rows=15 loops=1)
         Strategy: Filter(IndexScan) -> Vector
         Rows Retrieved: 15 count
         TIDs Collected: 303062 count
         Buffers: shared hit=11905
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.091 ms
 Execution Time: 28.927 ms
(11 rows)

This plan is an ANN with pre-filtering, effective when both the column filter and the embedding filter are selective and need to be combined for efficient execution. The optimizer makes this decision dynamically.

TIDs Collected indicates the number of rows matching the indexable portion of the filter prior to the vector search. According to the cardinalities, the id condition was used for prefiltering:

postgres=> select count(*) FROM embeddings_table 
           WHERE id <= 494461;
 count
--------
 303063

This represents a selectivity of about 15% for the pre-filtered set. The remaining condition (color = 'red') is applied afterward, reducing the result further to about 5% of the table.

Why this matters

These experiments reveal a different execution model, not just performance differences. With pgvector and HNSW, search is driven by a fixed budget: exploring a set number of candidates (ef_search) with the filter applied afterward. If the filter removes many candidates, fewer rows are returned, lowering recall. Achieving correct results requires tuning (ef_search), enabling iterative scans, or creating dedicated partial indexes.

With DiskANN, there is no fixed number of candidates set. Instead, it adapts based on the query. When the filter isn't sufficiently selective, it even bypasses the filtered ANN, reverting to vector search and post-filtering. Unlike HNSW, the number of candidates explored isn't fixed, allowing the engine to return the desired number of rows.

When the filter is very selective, such as id = constant, the optimizer skips the ANN and instead uses a standard index scan, calculating distances on the smaller result set.

When the filter is moderately selective, the execution model changes. DiskANN incorporates the predicate into traversal with a filtered ANN scan (DiskANNFilteredScan). Search is limited to the filtered subset, retrieving exactly the requested neighbors without exploring irrelevant regions.

DiskANN chooses execution modes based on filter selectivity, adjusting the effort required to return accurate results. Here are the execution strategies observed in this example:

Predicate	Selectivity (approx.)	Strategy	What happens
`color = 'red'`	~33% (>12%)	ANN + post-filter	Filtered ANN is skipped. The engine explores more candidates (~57 to get 15) instead of a fixed budget and returns the expected rows.
`id = constant`	~0% (<1%)	Exact index scan	ANN is bypassed. The primary key index is used, and distances are computed on the small result set.
`color='red' AND id <= ...`	~5% (<12%)	Filtered ANN	The indexable predicate is used to collect a filtered candidate set (`DiskANNFilteredScan`), and ANN runs within that reduced set.

It is also interesting to note that, on the same dataset, the DiskANN index is not significantly larger than the HNSW index:

postgres=> SELECT indexrelid::regclass AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS size, am.amname AS access_method
FROM pg_index i JOIN pg_class c ON c.oid = i.indexrelid JOIN pg_am am ON c.relam = am.oid WHERE indrelid = 'embeddings_table'::regclass;
 index_name |  size   | access_method
------------+---------+---------------
 i_hnsw     | 5208 MB | hnsw
 i_diskann  | 5545 MB | diskann

The behavior observed in the execution plans is actually governed by configuration parameters:

postgres=> \dconfig diskann.*

           List of configuration parameters

              Parameter               |     Value
--------------------------------------+---------------
 diskann.cost_reduced_beta            | 0.8
 diskann.enable_filter_hook           | on
 diskann.enable_legacy_vector_compat  | on
 diskann.enable_neighbor_prefetch     | on
 diskann.filtering_beta               | 0.8
 diskann.iterative_search             | Relaxed_Order
 diskann.l_value_is                   | 100
 diskann.selectivity_min              | 0.01
 diskann.selectivity_threshold        | 0.12
 diskann.skip_prefetch_buffer_divisor | 2
 diskann.skip_prefetch_if_cached      | on
 diskann.verbose_explain              | off
(12 rows)

For example, diskann.selectivity_threshold = 0.12 explains why a filter that matches about one-third of the rows skips the filtered ANN, and the combined predicate (with an estimated selectivity around 5%) is below the threshold. selectivity_min = 0.01 explains why the condition on one row over two million bypasses the ANN entirely. Based on the estimated selectivity, values between those thresholds are eligible for a filtered scan (DiskANNFilteredScan).

Those parameter can be tuned, here is the documentation:

Scalable Vector Indexing with DiskANN - Azure HorizonDB | Microsoft Learn

Use the pg_diskann extension to enable scalable, high-performance vector indexing in Azure HorizonDB for efficient semantic similarity search in large datasets, with advanced filtering for combined vector and metadata queries.

learn.microsoft.com

Conclusion

HNSW requires tuning a parameter (ef_search) per query, and always uses post-filtering, which can lead to incomplete results. DiskANN hides much of this complexity behind the optimizer. It adapts the number of candidates explored, applies filtering during the search when beneficial, and avoids ANN altogether when unnecessary.

This difference matters because real-world applications rarely perform vector searches without filters. Multi-tenancy, time windows, geography, access control, status, language, or category filters all mean that nearest neighbors must be identified within a subset rather than across the entire dataset. Nobody wants to scan index entries belonging to others in multi-tenant environments.

DiskANN uses pgvector semantics—the same data types, operators (<=> for cosine distance, <-> for L2, <#> for inner product), and ORDER BY ... LIMIT syntax—so queries remain the same. Only the index changes to improve performance and recall in Azure database services.

DEV Community