In a previous post, I explained that using filtering with pgvector can decrease recall in approximate nearest neighbor (ANN) searches.
This post repeats the same experiment and dataset but compares two methods:
- pgvector with HNSW, a popular PostgreSQL extension
- HorizonDB with DiskANN, Microsoft’s vector index
The goal is to understand what happens when similarity search is combined with filtering.
Both setups were tested on a HorizonDB instance with 2 vCores and 16 GiB RAM in Azure public preview, where I activated the extensions by adding them to azure.extensions.
In real applications, SQL statements without a WHERE clause are rare, and typical queries combine a filter alongside a similarity search:
SELECT *
FROM embeddings
WHERE tenant_id = 42
ORDER BY embedding <=> :vector_query
LIMIT 10;
You don’t search the entire dataset. Instead, you focus on a specific subset, such as one tenant, recent data, or a category.
Experiment Setup
As in the previous post, I inserted 2 million rows of 512‑dimensional vectors with a simple metadata column (color, evenly distributed):
CREATE TABLE embeddings_table (
id bigserial PRIMARY KEY,
color text,
embedding vector(512)
);
I generated random vectors to insert the embeddings, and three values for color:
create function random_embedding(dimensions int) returns vector as
$$
select
array(
select random()::real
from generate_series(1, dimensions)
)::vector
$$ language sql;
insert into embeddings_table (embedding,color)
select random_embedding(512)
,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;
This uniform distribution is intentional: it isolates the effect of filtering.
pgvector (HNSW)
Here's the HNSW index that I created in my previous post:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE INDEX ON embeddings_table
USING hnsw (embedding vector_cosine_ops)
;
It can take some time. Here is the query I use to monitor progress:
SELECT
command,
phase,
CASE WHEN blocks_total > 0 THEN round(100.0 * blocks_done / blocks_total, 1) ELSE 0 END AS blocks_pct,
CASE WHEN tuples_total > 0 THEN round(100.0 * tuples_done / tuples_total, 1) ELSE 0 END AS tuples_pct,
relid::regclass AS table_name,
index_relid::regclass AS index_name
FROM pg_stat_progress_create_index
;
I might have raised the maintenance_work_mem as indicated by the NOTICE after index creation — it took more than one hour.
NOTICE: hnsw graph no longer fits into maintenance_work_mem after 195123 tuples
CREATE INDEX
Total execution time: 01:20:03.861
I generate another embedding for my query, which I'll store as a psql variable with \gset:
select random_embedding(512) as query
\gset
Without index - Seq Scan and Exact Nearest Neighbor Search
I checked the behavior before the index is created, without using a vector index:
--EXPLAIN (ANALYZE, BUFFERS, VERBOSE off, COSTS off)
SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
Without a vector index, pgvector performs an exact nearest neighbors (ENN) search:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
id | color | nn_cosine
---------+-------+---------------------
1370302 | red | 0.2639440793770883
1048027 | red | 0.23483715071295053
804529 | red | 0.24173628441472828
1013860 | red | 0.25945607630683887
1011571 | red | 0.23552944123421315
1514713 | red | 0.2324248794732996
1411876 | red | 0.21627272961247324
1743091 | red | 0.25835400937319386
494461 | red | 0.23541082932461943
2128600 | red | 0.2531862056756401
2118649 | red | 0.27055299026445434
2151097 | red | 0.25835806753276225
1988278 | red | 0.2341769026974998
1259950 | red | 0.21806901991480443
1553203 | red | 0.24358174125683563
(15 rows)
The execution plan indicates that the color = 'red' filter was applied to the scan result, and the distance was computed from the filtered data to sort and return the top-15.
Limit (actual time=4088.254..4164.226 rows=15 loops=1)
Buffers: shared hit=5348519
-> Gather Merge (actual time=4088.252..4164.222 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5348519
-> Sort (actual time=4058.895..4058.897 rows=11 loops=3)
Sort Key: ((embedding <=> '[0.51098114,0.8720485,0.8568168,...,0.110820755]'::vector))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=5348519
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on embeddings_table (actual time=0.127..3949.246 rows=222222 loops=3)
Filter: (color = 'red'::text)
Rows Removed by Filter: 444445
Buffers: shared hit=5348445
Planning Time: 0.078 ms
Execution Time: 4164.252 ms
(18 rows)
This approach works when the filter is sufficiently selective to enable distance calculations and sorting on a small set of rows. However, with larger or expanding data sets, a quick neighbor search requires an index that can efficiently find neighbors.
With HNSW index - Approximate Nearest Neighbor Index Scan
Once the index is created, the query runs faster, but the result, compared to the previous execution, is incomplete:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
id | color | nn_cosine
---------+-------+---------------------
1259950 | red | 0.21806901991480443
744973 | red | 0.21565428393281894
550210 | red | 0.2085433809181183
1860280 | red | 0.24081963623331726
1794061 | red | 0.22253304456085543
1056052 | red | 0.232418043420476
1247836 | red | 0.22905966001837208
1128376 | red | 0.24222950580938118
847906 | red | 0.24276414827559645
1782067 | red | 0.233356563767256
2023984 | red | 0.24263831933162294
1438981 | red | 0.21360709574533177
(12 rows)
This is what happened:
- The HNSW index retrieves a fixed number of candidates
- PostgreSQL applies the WHERE filter afterward
- Only matching rows are returned, which can result in fewer rows than requested
For example, if the index examines about 40 candidates (ef_search = 40, the default) and the filter matches one third of the rows, we would expect about 13 matching rows on average. This is approximate because the index examines a local region of the vector space, where the metadata distribution may differ from the global distribution.
In this execution plan, 40 candidates were visited (12 returned, 28 filtered out), but only 12 matched the predicate, falling short of the query's LIMIT 15. These candidates are not necessarily the closest neighbors that satisfy the filter. The true nearest neighbors with color = 'red' might lie outside the explored region and were therefore not considered, because the region included other colors.
In this controlled case, where vectors are uniformly distributed, the ANN result is not only smaller but also deviates significantly from the ENN baseline, with very little overlap between the returned rows and the true nearest neighbors in the filtered subset.
In the execution plan, this is evident as the condition on color appears as a Filter applied after the index scan:
Limit (actual time=2.352..2.496 rows=12 loops=1)
Buffers: shared hit=1775
-> Index Scan using i_hnsw on embeddings_table (actual time=2.351..2.493 rows=12 loops=1)
Order By: (embedding <=> '[0.51098114,0.8720485,0.8568168,...,0.110820755]'::vector)
Filter: (color = 'red'::text)
Rows Removed by Filter: 28
Buffers: shared hit=1775
Planning:
Buffers: shared hit=1
Planning Time: 0.054 ms
Execution
The execution plan details show what happened: 12 returned + 28 filtered = 40 candidates, which is ef_search.
HNSW applies filtering after traversal, meaning the graph is searched globally before rows are discarded. This breaks the assumption that the closest neighbors are among the candidates returned by the index, especially when using selective filters, as they often are not.
The mitigations include:
- Increasing
ef_search. For instance, based on our data distribution, settinghnsw.ef_search = 200could provide enough candidates so that at least 15 remain after filtering. - Using partial indexes when filtering on a small set of discrete, known values, so each is indexing a specific subset.
- Enabling iterative scans to expand the search scope when filtering removes too many candidates. This improves recall but requires multiple scans.
These measures improve recall, though they come with higher costs and require manual calibration because filtering is handled separately from the ANN search.
DiskANN on HorizonDB
Now I create a DiskANN index, designed for filtering during the search:
postgres=> CREATE EXTENSION IF NOT EXISTS pg_diskann;
CREATE EXTENSION
postgres=> CREATE INDEX i_diskann ON embeddings_table
USING diskann (embedding vector_cosine_ops)
;
CREATE INDEX
Total execution time: 00:42:40.800
In HorizonDB documentation, it is explained that DiskANN enables advanced filtering for combined vector and metadata queries. Depending on selectivity and planner estimates, it can apply filters during traversal or fall back to post-filtering
With DiskANN index - Approximate Nearest Neighbor Index Scan
After creating the index, I run the same query:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
NOTICE: pg_diskann: Filter selectivity too high (0.3325), skipping filtered vector scan
id | color | nn_cosine
---------+-------+---------------------
1968337 | red | 0.24740531343440575
545257 | red | 0.22958854707242105
828484 | red | 0.24054936837179086
744973 | red | 0.21565428393281894
307819 | red | 0.23955190885002464
557251 | red | 0.2158213914722017
1860280 | red | 0.24081963623331726
1794061 | red | 0.22253304456085543
1048324 | red | 0.24138402291179228
494461 | red | 0.23541082932461943
851149 | red | 0.26081572380948004
832813 | red | 0.2240000332741019
745516 | red | 0.24130828891293765
233791 | red | 0.2324244758968873
1247836 | red | 0.22905966001837208
(15 rows)
The note is interesting: pre-filtering was not used. Still, the result is complete because DiskANN dynamically adapts the number of candidates explored.
This means the filter was not applied during traversal, so the query falls back to post-filtering, as with pgvector. However, unlike HNSW, the number of candidates is not fixed: the engine explores more vectors until enough rows are returned.
In the execution plan, we see that 42 rows were filtered out and 15 were returned, meaning about 57 candidates were explored:
Limit (actual time=6.607..10.821 rows=15 loops=1)
Buffers: shared hit=5772
-> Index Scan using i_diskann on embeddings_table (actual time=6.606..10.817 rows=15 loops=1)
Order By: (embedding <=> '[0.51098114,0.8720485,0.8568168,0.3119626,...,0.110820755]'::vector)
Filter: (color = 'red'::text)
Rows Removed by Filter: 42
Buffers: shared hit=5772
Planning:
Buffers: shared hit=2
Planning Time: 0.104 ms
Execution Time: 10.893 ms
(11 rows)
By varying the LIMIT, we can see that even without pre-filtering, the results are complete and remain fast for common top-n queries:
| LIMIT | Rows Removed by Filter | rows= | Buffers | Execution Time |
|---|---|---|---|---|
| 1 | 1 | 1 | 3779 | 4.215 ms |
| 2 | 1 | 2 | 3790 | 6.431 ms |
| 5 | 22 | 5 | 4613 | 8.153 ms |
| 10 | 29 | 10 | 5310 | 9.841 ms |
| 15 | 42 | 15 | 5722 | 10.893 ms |
| 100 | 222 | 100 | 15364 | 39.989 ms |
| 1000 | 1879 | 1000 | 91585 | 826.303 ms |
| 10000 | 20202 | 10000 | 602407 | 27827.608 ms |
This distinction is crucial: pgvector uses a fixed search budget (ef_search), whereas DiskANN adjusts the number of explored candidates. Even with post-filtering, it can provide enough rows without manual tuning and without incremental scans.
Highly selective query - Exact Nearest Neighbor
If the filter is highly selective, there is no need to use an approximate search:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red' and id = 494461
ORDER BY embedding <=> :'query'
LIMIT 15
;
NOTICE: pg_diskann: Filter selectivity too low (0.0000), using filter-only scan
id | color | nn_cosine
--------+-------+---------------------
494461 | red | 0.23541082932461943
(1 row)
The execution plan shows that it didn't use the vector index but instead used the B-tree index on id with an additional Sort for exact nearest-neighbor ordering:
Limit (actual time=0.038..0.039 rows=1 loops=1)
Buffers: shared hit=12
-> Sort (actual time=0.037..0.038 rows=1 loops=1)
Sort Key: ((embedding <=> '[0.51098114,0.8720485,0.8568168,0.311962,...,0.110820755]'::vector))
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=12
-> Index Scan using embeddings_table_pkey on embeddings_table (actual time=0.033..0.034 rows=1 loops=1)
Index Cond: (id = 494461)
Filter: (color = 'red'::text)
Buffers: shared hit=12
Planning:
Buffers: shared hit=2
Planning Time: 0.140 ms
Execution Time: 0.053 ms
(14 rows)
DiskANN doesn’t even participate when the filtering is highly selective. The planner uses a B-tree index and bypasses ANN.
With DiskANN index - Pre-filtered Approximate Nearest Neighbor
When filters are selective enough to be applied during the scan but not enough to avoid ANN search, there's no NOTICE:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red' and id <= 494461
ORDER BY embedding <=> :'query'
LIMIT 15
;
id | color | nn_cosine
--------+-------+---------------------
418054 | red | 0.21522295998337881
233791 | red | 0.2324244758968873
377470 | red | 0.23508242287803505
196009 | red | 0.2356772244494152
225253 | red | 0.24119501226879858
233026 | red | 0.23055577367239388
308998 | red | 0.21652541614338927
291283 | red | 0.2319492505072145
459442 | red | 0.2448930593940325
296956 | red | 0.21969539585363562
223876 | red | 0.24664282569002072
385669 | red | 0.21339531139470636
390397 | red | 0.2484855378229266
296467 | red | 0.22426404565684388
220285 | red | 0.2179154018451408
(15 rows)
This is where the execution plan displays a Custom Scan (DiskANNFilteredScan), enabled by PostgreSQL's extensibility. It performs pre-filtering (Filter(IndexScan)) and Approximate Nearest Neighbor search (Vector) to yield the precise number of candidates in the filtered set (Rows Retrieved: 15 count):
QUERY PLAN
--------------------------------------------------------------------------------------
Limit (actual time=25.842..28.876 rows=15 loops=1)
Buffers: shared hit=11905
-> Custom Scan (DiskANNFilteredScan) (actual time=25.841..28.872 rows=15 loops=1)
Strategy: Filter(IndexScan) -> Vector
Rows Retrieved: 15 count
TIDs Collected: 303062 count
Buffers: shared hit=11905
Planning:
Buffers: shared hit=2
Planning Time: 0.091 ms
Execution Time: 28.927 ms
(11 rows)
This plan is an ANN with pre-filtering, effective when both the column filter and the embedding filter are selective and need to be combined for efficient execution. The optimizer makes this decision dynamically.
TIDs Collected indicates the number of rows matching the indexable portion of the filter prior to the vector search. According to the cardinalities, the id condition was used for prefiltering:
postgres=> select count(*) FROM embeddings_table
WHERE id <= 494461;
count
--------
303063
This represents a selectivity of about 15% for the pre-filtered set. The remaining condition (color = 'red') is applied afterward, reducing the result further to about 5% of the table.
Why this matters
These experiments reveal a different execution model, not just performance differences. With pgvector and HNSW, search is driven by a fixed budget: exploring a set number of candidates (ef_search) with the filter applied afterward. If the filter removes many candidates, fewer rows are returned, lowering recall. Achieving correct results requires tuning (ef_search), enabling iterative scans, or creating dedicated partial indexes.
With DiskANN, there is no fixed number of candidates set. Instead, it adapts based on the query. When the filter isn't sufficiently selective, it even bypasses the filtered ANN, reverting to vector search and post-filtering. Unlike HNSW, the number of candidates explored isn't fixed, allowing the engine to return the desired number of rows.
When the filter is very selective, such as id = constant, the optimizer skips the ANN and instead uses a standard index scan, calculating distances on the smaller result set.
When the filter is moderately selective, the execution model changes. DiskANN incorporates the predicate into traversal with a filtered ANN scan (DiskANNFilteredScan). Search is limited to the filtered subset, retrieving exactly the requested neighbors without exploring irrelevant regions.
DiskANN chooses execution modes based on filter selectivity, adjusting the effort required to return accurate results. Here are the execution strategies observed in this example:
| Predicate | Selectivity (approx.) | Strategy | What happens |
|---|---|---|---|
color = 'red' |
~33% (>12%) | ANN + post-filter | Filtered ANN is skipped. The engine explores more candidates (~57 to get 15) instead of a fixed budget and returns the expected rows. |
id = constant |
~0% (<1%) | Exact index scan | ANN is bypassed. The primary key index is used, and distances are computed on the small result set. |
color='red' AND id <= ... |
~5% (<12%) | Filtered ANN | The indexable predicate is used to collect a filtered candidate set (DiskANNFilteredScan), and ANN runs within that reduced set. |
It is also interesting to note that, on the same dataset, the DiskANN index is not significantly larger than the HNSW index:
postgres=> SELECT indexrelid::regclass AS index_name, pg_size_pretty(pg_relation_size(indexrelid)) AS size, am.amname AS access_method
FROM pg_index i JOIN pg_class c ON c.oid = i.indexrelid JOIN pg_am am ON c.relam = am.oid WHERE indrelid = 'embeddings_table'::regclass;
index_name | size | access_method
------------+---------+---------------
i_hnsw | 5208 MB | hnsw
i_diskann | 5545 MB | diskann
The behavior observed in the execution plans is actually governed by configuration parameters:
postgres=> \dconfig diskann.*
List of configuration parameters
Parameter | Value
--------------------------------------+---------------
diskann.cost_reduced_beta | 0.8
diskann.enable_filter_hook | on
diskann.enable_legacy_vector_compat | on
diskann.enable_neighbor_prefetch | on
diskann.filtering_beta | 0.8
diskann.iterative_search | Relaxed_Order
diskann.l_value_is | 100
diskann.selectivity_min | 0.01
diskann.selectivity_threshold | 0.12
diskann.skip_prefetch_buffer_divisor | 2
diskann.skip_prefetch_if_cached | on
diskann.verbose_explain | off
(12 rows)
For example, diskann.selectivity_threshold = 0.12 explains why a filter that matches about one-third of the rows skips the filtered ANN, and the combined predicate (with an estimated selectivity around 5%) is below the threshold. selectivity_min = 0.01 explains why the condition on one row over two million bypasses the ANN entirely. Based on the estimated selectivity, values between those thresholds are eligible for a filtered scan (DiskANNFilteredScan).
Conclusion
HNSW requires tuning a parameter (ef_search) per query, and always uses post-filtering, which can lead to incomplete results. DiskANN hides much of this complexity behind the optimizer. It adapts the number of candidates explored, applies filtering during the search when beneficial, and avoids ANN altogether when unnecessary.
This difference matters because real-world applications rarely perform vector searches without filters. Multi-tenancy, time windows, geography, access control, status, language, or category filters all mean that nearest neighbors must be identified within a subset rather than across the entire dataset. Nobody wants to scan index entries belonging to others in multi-tenant environments.
DiskANN uses pgvector semantics—the same data types, operators (<=> for cosine distance, <-> for L2, <#> for inner product), and ORDER BY ... LIMIT syntax—so queries remain the same. Only the index changes to improve performance and recall in Azure database services.





Top comments (0)