DEV Community

Elise Tanaka
Elise Tanaka

Posted on

Filtered Vector Search: Five Techniques for Balancing Recall and Latency

When I first implemented vector search for an e-commerce platform, I assumed combining metadata filters with ANN queries would be straightforward. My naïveté vanished when users searched for "red shoes under $100" and faced empty results or 10-second latencies. Through trial and benchmarking across 10M+ vector datasets, I identified five key techniques to resolve this.

1. Graph Index Repair for Broken Connectivity

Standard graph indexes (HNSW, DiskANN) fail catastrically under heavy filtering. Removing 90% of nodes creates isolated data islands. Consider a product graph: eliminating a hub node destroys paths between connected items. I measured recall dropping below 40% in such cases.

Solutions I tested:

  • Alpha Strategy: Probabilistically visiting filtered nodes (e.g., 20% probability when 80% filtered) preserved 85% recall at 30ms latency in 10M Cohere embeddings (768D).
  • Connection Reinforcement: Skipping edge pruning during indexing retained critical pathways. This added 15% memory overhead but maintained >90% recall at 50% filtering.

When to avoid: For sub-1% filtering ratios, brute-force scanning outperforms graph traversal. Benchmark using:

# Pseudocode for adaptive strategy
if filtering_ratio > 0.95:
    use_brute_force()
elif filtering_ratio > 0.5:
    use_alpha_strategy(alpha=0.3)
else:
    use_standard_traversal()
Enter fullscreen mode Exit fullscreen mode

2. Metadata-Aware Subgraphs

Conventional single-index architectures force irrelevant comparisons. Shoes priced at $50 have no semantic relationship to $50 belts. My solution: build column-specific subgraphs.

Implementation:

Base Graph (All Products)
│
├── Color Subgraphs
│   ├── Red
│   ├── Blue
│   └── Green
│
└── Price Subgraphs
    ├── $0-$50
    ├── $50-$100
    └── $100+
Enter fullscreen mode Exit fullscreen mode

Searches for color=red used the red subgraph, reducing traversal time by 63% versus base graph filtering. Memory overhead was linear to unique metadata values – acceptable for low-cardinality fields (<1000 variants).

3. Iterative Batch Filtering

Complex metadata filters (e.g., JSON arrays) create evaluation bottlenecks. Testing 10M vectors required 8GB of RAM and 900ms latency. Iterative filtering solved this:

  1. Retrieve top 200 vector candidates
  2. Apply metadata filters
  3. If results < required, fetch next 200
  4. Repeat until sufficient matches Benchmark results (1M vectors):
Method Avg. Latency Filter Eval Count
Filter-First 1200ms 1,000,000
Vector-First 350ms* 10,000*
Iterative 85ms 2,000

* Resulted in 40% recall due to over-filtering

4. External Filtering Hybrids

When vectors and metadata live in separate systems (e.g., PostgreSQL + vector DB), ID transfers become prohibitive. For 50M+ datasets, transferring filtered IDs added 700ms network overhead. My client-side solution:

def external_filter(hits: list[Hit]) -> list[Hit]:
    valid_ids = query_postgres("SELECT id FROM products WHERE price < 100") # Cached
    return [hit for hit in hits if hit.id in valid_ids]

search_iter = vector_db.search_iterator(
    data=query_vector,
    batch_size=500,
    filter_func=external_filter
)
Enter fullscreen mode Exit fullscreen mode

This reduced network payloads by 92% and enabled sub-100ms hybrid queries across distributed systems.

5. Auto-Tuning Index Selection

Balancing search_radius, filter_strategy, and batch_size requires untenable manual tuning. I developed rules for dynamic configuration:

Filter Ratio Index Type Search Radius
<10% HNSW Low (n=50)
10-75% Hybrid Medium (n=100)
>75% Brute-Force High (n=200)

Automating this via query statistics maintained 95% recall while adapting to shifting data distributions.

Deployment Tradeoffs
Hardware implications:

  • Memory-optimized nodes (r7gd AWS instances) for graph indexes
  • Compute-optimized for brute-force fallbacks
  • SSD storage mandatory >20M vectors

Consistency compromises:

  • Eventual consistency suffices for recommendation systems
  • Strong consistency required for transaction systems
  • Hybrid: Session consistency for user-facing searches

Next Exploration Targets
I'm investigating three underutilized techniques:

  1. GPU-Accelerated Filtering: Offloading JSON filters to NVIDIA RAPIDS
  2. Cost-Based Optimizers: Machine learning for adaptive strategy switching
  3. Materialized Metadata Views: Precomputing common filter combinations

Filtered vector search requires architectural compromises, not magic bullets. Each solution trades memory, latency, or recall. What I’ve proven: pragmatic multi-strategy approaches support production workloads at <100ms P99 latency.

Top comments (0)