DEV Community

Cover image for Filtered Vector Search: Where Every Benchmark Quietly Lies to You
Arya Koste
Arya Koste Subscriber

Posted on

Filtered Vector Search: Where Every Benchmark Quietly Lies to You

Let me show you a magic trick every vector database benchmark performs.

Step one: run a query with no filter. Just "find me the nearest neighbors to this vector." Step two: report the gorgeous QPS numbers and the 95% recall. Step three: take a bow while everyone claps.

Here is the query you actually run in production:

nearest neighbors to <embedding>
WHERE tenant_id = 'acme'
  AND status = 'active'
  AND created_at > '2026-01-01'
Enter fullscreen mode Exit fullscreen mode

And here is where the applause stops. Latency triples. Recall quietly falls off a cliff. You ask for ten results and get three, and nobody notices until a user files a ticket that just says "search feels broken." No stack trace. No error. Just vibes, and the vibes are bad.

This is filtered vector search, and it is the single most under-discussed part of the whole retrieval stack. People will argue for three hours about pgvector versus Pinecone and then hand-wave the one thing that actually decides whether their search works: what happens when you bolt a WHERE clause onto an approximate nearest neighbor search.

So let's talk about it. Not which database to buy. The actual mechanics of why filtering a vector search is weirdly, unintuitively hard, the three ways to do it, why two of them fail in exactly opposite situations, and the newer trick that makes a filter actually speed your search up. Stick around for that last part, it's genuinely delightful.

Why your SQL instincts are lying to you

Quick gut check. In a normal database, does adding WHERE status = 'active' make a query faster or slower?

Faster, obviously. You've got an index on status, you look up the matching rows, you scan less stuff. Filtering is a discount. This instinct has served you well for your entire career.

Throw it in the bin. It does not apply here, and understanding why is the whole game.

Vector search over millions of embeddings is only fast because of a structure like HNSW (Hierarchical Navigable Small World). Picture a graph where every vector is a dot, and each dot is wired to its nearest neighbors. A search drops in at an entry point and walks the graph greedily, hopping dot to dot toward your query vector, touching a tiny sliver of the total data. That walk is the magic. It's what turns "compare against all 10 million vectors" into "visit maybe 1,200 of them."

Now add a filter. Suddenly a bunch of dots are off-limits because they don't match tenant_id = 'acme'. But here's the thing: the graph's wiring was built assuming every dot is fair game. Rip out the forbidden dots and you punch holes in the connectivity the walk depends on. The route the search needed might run straight through nodes that no longer exist.

HNSW graph before and after a filter: on the left a clean path runs from entry to target; on the right, filtered-out nodes sever that path

Look at the right side. The path the search wanted to take got cut because the filter deleted the stepping stones. The search is now standing on one side of a river that used to have a bridge. That's the core tension of this entire topic: the index that makes vector search fast assumes nobody's filtering, and every filter is a small betrayal of that assumption.

There are three ways to cope. They are not equally good, and the two obvious ones fail in beautifully opposite ways.

The three strategies, at a glance

Three strategies compared: post-filtering searches then discards and breaks on restrictive filters; pre-filtering filters then searches and breaks on large subsets; filter-aware filters during traversal and handles the messy middle

Keep this picture in your head as we go. Post-filtering breaks on the right side of the selectivity range, pre-filtering breaks on the left, and only the third one covers the middle where your real queries live. Now the details.

Strategy 1: Post-filtering (the "search now, apologize later" approach)

The obvious move. Run the normal vector search, get your candidates, then chuck the ones that don't match.

# Post-filtering
candidates = ann_search(query_vector, k=100)   # filter? what filter?
results = [c for c in candidates if c.status == "active"][:10]
Enter fullscreen mode Exit fullscreen mode

Simple. No special index magic required. And it's totally fine as long as your filter is loose, meaning most of the dataset passes it. If 80% of your rows are active, your top 100 candidates will contain plenty of active ones. Everybody's happy.

Now watch it detonate. Suppose only 1% of your points match. You pull 100 candidates from the unfiltered search, and on average... one of them is active. You asked for ten results. You got one. The actual ten closest active vectors? Never in the running. They were sitting further out in the graph, politely waiting behind a wall of inactive points that hogged your top 100.

"Just fetch more," you say, sweat forming. "Grab the top 1,000." Sure, sometimes that rescues you. You're also now doing 10x the search work, your latency is climbing, and against a really nasty filter you're still just gambling with extra steps. Over-fetching isn't a fix, it's a bribe you pay the problem to leave you alone, and the bribe scales with exactly how selective your filter is.

Post-filtering dies when filters are restrictive. The more you filter, the more it flails.

Strategy 2: Pre-filtering (the "opposite mistake")

Fine, do the reverse. Find everything matching the filter first, then search only inside that subset.

# Pre-filtering
allowed_ids = metadata_index.lookup(status="active")   # inverted index, etc.
results = ann_search(query_vector, k=10, restrict_to=allowed_ids)
Enter fullscreen mode Exit fullscreen mode

For a restrictive filter this sounds perfect, and for small subsets it genuinely is chef's kiss. If the filter narrows you down to a few thousand vectors, just brute-force the search over those and get exact, high-recall results, fast. This is exactly why good engines have a "if the subset is tiny, skip the fancy graph and just scan it" fallback. Correct call.

But scale that allowed set up and the whole thing flips on you. Remember the graph runs on connectivity. When you restrict the walk to a subset, you're yanking all the non-matching dots out mid-search. On a big dataset where the filter still matches, say, a few hundred thousand scattered points, you fragment the graph exactly like our first diagram: the greedy walk keeps trying to route through excluded nodes, hits dead ends, and either returns garbage or collapses into something slow. The logarithmic magic is gone. You're back to expensive.

So pre-filtering is the exact mirror image of post-filtering. Post handles loose filters and dies on tight ones. Pre handles tiny result sets and dies on large ones. Which leaves a lovely canyon in the middle, a moderately selective filter over a large collection, where both approaches whimper. Guess where basically every real production query lives? The canyon. Everyone lives in the canyon.

Pre-filtering dies when the filtered subset is large and scattered.

Strategy 3: Filter-aware search (the one that actually works)

The grown-up answer is to stop treating the filter as a thing that happens before or after the search, and fold it into the walk itself. The graph traversal still happens, but now the filter rides along and steers: when the search picks which neighbor to hop to next, it's already accounting for the filter instead of ignoring it (post) or being blindly walled off by it (pre).

Two flavors worth knowing, because they map onto real toggles you'll flip in production.

Filter-aware edges, built ahead of time. One approach adds extra edges to the graph based on the fields you'll filter on. For a field like category, the index quietly builds connected subgraphs per value, so when you later filter category = 'laptop', there's already a fully-connected laptop-only graph to walk. Search quality holds up even under tight filters, with basically no runtime tax, because the work happened at build time. The catch is combinations: the index can pre-wire category and brand separately, but it can't pre-wire every category AND brand AND price combo, the math explodes. So one strict filter is smooth, but stack two or three high-selectivity filters and you can still fragment the graph, just fashionably later than you would have.

Adaptive traversal, decided at query time. A newer family (the one you'll hear named in 2026 is ACORN) skips the "know your filters in advance" requirement. The trick: during the walk, if a node's direct neighbors are all filtered out, don't quit, peek at the neighbors of those neighbors. A two-hop jump. This lets the search vault over filtered-out regions and stay connected even for filter combos it never trained for. The honest cost is that exploring more nodes takes more time per query, so you switch this on for the genuinely gnarly cases (complex, unpredictable, low-selectivity filters), not for every lookup. In practice it shows up as a per-query flag, precisely because it's a scalpel, not a bulldozer.

This is the only strategy that covers the canyon. And it comes with a payoff that feels like cheating.

The good part: a filter that makes search faster

With the naive strategies, filtering is pure tax. More work, more latency, more sighing. With real filter-aware search, the opposite can happen: when the filter genuinely shrinks the pool the search has to consider, and the engine prunes that work during the walk instead of after, adding a filter can lower your latency below the unfiltered query.

Latency versus filter selectivity: as filters tighten, post-filtering latency climbs, pre-filtering stays flat then fine on tiny sets, and filter-aware latency drops because pruning means less work

Look at that green line going down. Reported behavior on integrated engines follows this shape: an unfiltered query sits at some baseline, a 50%-selective filter pulls latency down, and a 1%-selective filter pulls it down further, because the search is simply doing less. That downward slope is the tell of a real filter-aware implementation. It's also a fantastic diagnostic when you're kicking the tires on an engine:

Run your workload filtered, then unfiltered. Watch which way latency moves. If filtering your data makes it faster, the engine is filtering inside the algorithm, that's the good stuff. If filtering reliably makes it slower, it's quietly doing pre- or post-filtering behind your back, and you should plan accordingly.

The gotchas nobody puts in the quickstart

Understanding the strategies is half the battle. The other half is the boring operational stuff that silently wrecks filtered search and never appears in a "hello world" tutorial. Here's where the bodies are buried.

Index your filter fields, and do it BEFORE building the vector index. This one bites everyone once. Filter-aware edges can only be built if the engine knows which fields you'll filter on at build time. Add a metadata index after the vector graph is already built and the filter-aware edges don't magically appear, you have to rebuild the vector index to get them. Rebuilding a graph on a big collection is not a coffee break, it's a "come back after lunch" operation. Order of operations: define your filterable fields, index them, then ingest and build. Future you will send a thank-you card.

Cardinality changes everything, and the engine may swap strategies mid-flight. Good engines estimate how many points a filter will match and pick a strategy per query off that estimate: brute-force the tiny subsets, use the filterable graph for the big ones. Great when it works, but it means your latency profile can shift depending on the specific values a user sends. A filter matching 50 rows and one matching 500,000 rows may run completely different code paths under the hood. So benchmark across the whole range of selectivities you expect, not just one tidy representative query that makes your graphs look nice.

Watch for correlated filters, the silent killer. The genuinely brutal case is when your filter is negatively correlated with vector distance, when the things that match your filter happen to sit far from the query in embedding space. Filter-aware traversal quietly assumes matching points are scattered reasonably near the query. When they're systematically not, even the clever two-hop tricks degrade. If your filtered searches are mysteriously bad and it's not a cardinality thing, ask whether your filter and your embeddings are pulling in opposite directions.

Over-fetching hides bugs. If you "solved" filtered search by cranking your candidate count to the moon, congratulations, you've buried the problem, not fixed it, and you're paying the latency tax on every single query forever. Every so often, check whether you still need that over-fetch or whether proper filter-aware indexing would let you quietly delete it.

A cheat sheet for which strategy you actually want

You usually won't hand-code these, the engine picks, but knowing which one you want tells you what to configure and what to test.

  • Loose filter (matches most of the data): post-filtering is fine. Go outside, touch grass, don't overthink it.
  • Tiny subset (a few thousand points): you want the brute-force fallback. Make sure your engine's threshold for triggering it is tuned so it actually fires.
  • Moderately selective filter over a large collection (the canyon, aka real life): filter-aware indexing, with your filter fields indexed up front. This is the setup most production RAG needs and the one people most reliably botch.
  • Multiple strict filters, or filters you can't predict: this is where query-time adaptive traversal (the two-hop family) earns its keep. Switch it on for these queries specifically. Don't pay its cost on everything.

The one-line takeaway

The database war everyone loves to fight matters less than the layer underneath it. Two engines with near-identical unfiltered benchmarks can behave like completely different products the second you add a WHERE clause, and your production queries basically always have a WHERE clause. So when you're evaluating anything for real retrieval work, filter it the way your app actually will, and measure both latency and recall against the true filtered nearest neighbors.

Because here's the thing to tattoo somewhere visible:

Unfiltered benchmarks measure a query you will never actually run.


Fine print on the numbers and named techniques: filtered ANN is an active research area as of 2026, and the specific behaviors here (two-hop traversal, per-value subgraphs, cardinality-based planning) are implemented differently across engines and still evolving. Treat this as the conceptual map, then verify the exact behavior against your engine's current docs and, more importantly, against your own data. The selectivity numbers are illustrative of the pattern, not universal constants.

Top comments (0)