Ken W Alger

Posted on May 27 • Originally published at kenwalger.com

Vector Search at Scale: Why Your Index Isn't as Healthy as You Think

#ai #vectorsearch #rag #architecture

Vector search has become load-bearing infrastructure in modern AI systems remarkably fast. A year or two ago, it was primarily a research curiosity and a niche tool for semantic search. Today it sits at the center of RAG pipelines, recommendation engines, multimodal retrieval systems, and a growing class of applications that reason over unstructured data.

The operational patterns haven't kept pace with the adoption.

Most teams that deploy vector search in production treat it the way they treated relational databases before they understood indexing: as infrastructure that works until it doesn't, with failure modes that aren't well understood until they've been encountered firsthand. The problems that emerge at scale — degraded recall, unpredictable latency, ghost results from deleted records — are preventable. But preventing them requires understanding how vector indices actually work, and what happens to them under continuous change.

This post is about that.

What Vector Search Is Actually Doing

Before getting into failure modes, it's worth being precise about what an ANN (Approximate Nearest Neighbor) index does and what tradeoffs it makes.

When you store a vector embedding in a vector database, you're storing a point in a high-dimensional space — a location in a space that might have 768, 1536, or more dimensions, depending on the embedding model. A vector search query asks: given a query vector, which stored vectors are closest to it in this space?

Exact nearest neighbor search — checking every stored vector against every query — is correct but computationally infeasible at scale. At 10 million vectors, exact search would require 10 million distance computations per query. ANN indices solve this by building a data structure that allows the search to skip most of the space and find approximately nearest neighbors with high probability.

The key word is approximately. ANN search trades a small, bounded amount of correctness (recall) for a large improvement in query speed. A well-tuned index might return the true 10 nearest neighbors 95% of the time — recall@10 of 0.95. That 5% gap is acceptable in most applications. What's not acceptable is when the gap grows unexpectedly in production, silently, because the index was built for a different data distribution than the one it's currently serving.

Recall is not a constant. It's a property of the relationship between your index structure and your data distribution. When the data changes, recall changes with it.

The Three Failure Modes at Scale

1. Index Degradation Under Continuous Updates

The most widely deployed ANN algorithm family is HNSW — Hierarchical Navigable Small World graphs. HNSW builds a layered graph structure where nodes (vectors) are connected to their approximate neighbors. Search traverses this graph, navigating from coarse layers to fine layers, to find approximate nearest neighbors efficiently.

HNSW was designed primarily for static datasets. Build the index once on your full dataset, and it performs extremely well. The problem is that production datasets aren't static. New embeddings are added continuously — new documents, new products, new user profiles. Existing embeddings are updated as the underlying content changes. Old embeddings are deleted when records are removed.

Each of these operations degrades the graph in a different way:

Insertions add new nodes but can't retroactively optimize the connections of existing nodes for the new additions. Over time, the graph's navigability — its ability to efficiently route search queries toward the right region of the space — erodes.

Updates in most implementations are deletions followed by insertions. The deletion leaves a gap in the graph; the insertion adds a new node without full integration into the surrounding neighborhood structure. Repeated updates accumulate structural debt.

Deletions are the most insidious. Most HNSW implementations handle deletion by marking vectors as deleted (a "tombstone") rather than fully removing them from the graph structure. Tombstoned vectors continue to participate in graph traversal — they're visited during search but filtered from results. As tombstones accumulate, search traversal becomes progressively slower and recall degrades as the graph structure increasingly reflects deleted nodes rather than live ones.

The result is an index that was fast and accurate at build time and becomes progressively slower and less accurate in production. The degradation is gradual enough that it often isn't noticed until performance crosses an obvious threshold — at which point the fix (a full index rebuild) requires downtime or careful traffic management.

2. Recall Degradation at Scale

A second failure mode is subtler: recall that was acceptable at your initial dataset size becomes unacceptable as the dataset grows.

ANN indices have tuning parameters that control the tradeoff between recall and query speed. For HNSW, the key parameter is ef (the size of the dynamic candidate list during search) — higher ef means more candidates considered, higher recall, slower queries. Index construction parameters like M (the number of connections per node) similarly affect the recall-latency tradeoff.

These parameters are typically tuned once, at index build time, against the dataset size and query distribution at that moment. As the dataset grows — from 1M to 10M to 100M vectors — the same parameter values produce worse recall. The index structure that was sufficient for navigating 1M vectors may miss relevant results regularly at 100M, because the candidate list that was large enough to catch most true neighbors at small scale isn't large enough to sample the same proportion of the space at large scale.

This is a capacity planning problem as much as a technical one. Teams that tune their indices once and treat those parameters as permanent settings will encounter recall degradation as a silent, gradual production issue.

3. Distribution Shift Between Embedding Model Updates

A third failure mode occurs when the embedding model itself changes.

Embeddings are not portable across model versions. A vector produced by text-embedding-ada-002 exists in a completely different geometric space than a vector produced by text-embedding-3-large. Even minor version updates to the same embedding model can shift the geometry of the embedding space enough to invalidate an existing index.

When teams update their embedding model — to gain quality improvements, reduce cost, or switch providers — they face a migration problem: the stored vectors must be recomputed using the new model, and the index must be rebuilt from scratch against the new embeddings. There is no incremental path.

This migration is expensive at scale: recomputing embeddings for millions of records requires significant compute and elapsed time. During the migration window, the system is either serving results from a stale index (old embeddings, old model) or managing a complex dual-index serving strategy that returns results from both indices during the transition.

Teams that haven't planned for embedding model migration tend to discover the problem when they want to upgrade and realize they've built a dependency that makes upgrading very expensive.

Architectural Responses

Segment-Based Indexing

The most operationally mature response to continuous update problems is a segment-based architecture, modeled on how LSM-tree databases (like RocksDB and Cassandra) handle write-heavy workloads.

Instead of a single monolithic index, the vector store maintains multiple index segments:

Hot segments: Small, recently built segments containing new vectors. Quick to rebuild when they become stale.
Warm segments: Medium-aged segments, rebuilt periodically as updates accumulate.
Cold segments: Large, stable segments containing vectors that haven't changed recently. Rarely rebuilt.

New vectors land in a hot segment. Query execution searches across all segments and merges results. Background compaction merges smaller segments into larger ones, rebuilding and re-optimizing the graph structure in the process.

New Vectors ──► Hot Segment (small, fresh, fast rebuild)
                     │
              [compaction]
                     ▼
              Warm Segment (medium, periodic rebuild)
                     │
              [compaction]
                     ▼
              Cold Segment (large, stable, infrequent rebuild)

Query ──► Search All Segments ──► Merge Results ──► Return Top-K

This architecture has several advantages over a monolithic index:

Deletions and updates only invalidate the segment containing the affected vector, not the entire index
Hot segments are small enough to rebuild quickly, containing the freshness penalty
Cold segments are stable enough to amortize the rebuild cost over long periods
The system can continue serving queries during segment rebuilds, because other segments remain available

The tradeoff is query complexity: searching multiple segments and merging results is more complex than searching a single index, and the merge step adds latency. The practical overhead is usually acceptable, but it requires explicit design.

Recall Monitoring as a Production Metric

The most important operational practice for vector search is one most teams skip: tracking recall as a runtime metric.

In offline evaluation, recall is a benchmark number computed against a ground-truth test set. In production, it's harder to measure — you don't always know the true nearest neighbors for live queries. But proxies are achievable:

Periodic ground-truth sampling: Run exact search (brute-force) on a sample of production queries and compare results to ANN results. The fraction of true nearest neighbors returned by ANN is your recall estimate.

Result set stability: If the same query returns significantly different results across consecutive executions with the same index, the index has structural inconsistencies worth investigating.

Latency as a leading indicator: For HNSW specifically, increasing query latency often precedes recall degradation as the graph becomes harder to navigate. A latency trend that diverges from query volume trend is worth investigating before recall drops.

def estimate_recall(query_vectors, k=10, sample_size=100):
    sample = random.sample(query_vectors, sample_size)
    recall_scores = []

    for query in sample:
        ann_results = index.search(query, k=k)
        exact_results = exact_search(query, k=k)  # brute force

        true_neighbors = set(exact_results.ids)
        ann_neighbors = set(ann_results.ids)
        recall = len(true_neighbors & ann_neighbors) / k
        recall_scores.append(recall)

    return sum(recall_scores) / len(recall_scores)

This is expensive to run continuously at full scale, which is why sampling is essential. But running it on a schedule — hourly, or triggered by index update volume thresholds — gives you early warning before recall degradation becomes user-visible.

Pre-filtering vs. Post-filtering for Hybrid Search

Production vector search is almost never pure semantic similarity. Real workloads apply metadata filters on top of vector search: most similar items in stock, most relevant documents in a user's language, most related customers above a revenue threshold.

There are three architectural patterns for combining metadata filtering with ANN search, each with different performance and correctness profiles:

Post-filtering: Run ANN search broadly across all vectors, then apply the metadata filter to the results. Simple to implement, but wasteful — if the filter is highly selective (only 1% of vectors pass), you'll need to retrieve far more than K candidates from ANN to end up with K results after filtering. Recall can collapse under selective filters.

Pre-filtering: Apply the metadata filter first to get a candidate set, then run exact or approximate search within that set. More correct under selective filters, but the candidate set must be small enough for efficient search — and for highly selective filters on large datasets, this can mean materializing and searching millions of vectors.

In-graph filtering: Build filter awareness into the index structure itself, so the graph traversal respects filter constraints without a separate pre- or post-filter step. More complex to implement, but avoids the recall collapse of post-filtering and the candidate materialization cost of pre-filtering. This is the approach emerging in more mature vector database implementations.

The right choice depends on your query distribution — specifically, how selective your filters are on average. If most queries filter to a large fraction of the dataset, post-filtering works well. If queries are frequently highly selective, you need in-graph filtering or a carefully designed pre-filtering strategy.

This is a decision worth validating against your actual query distribution, not just the average case.

Embedding Model Migration: Planning for the Inevitable

Given that embedding model migration is expensive, the right time to plan for it is before you need it — during the initial architecture design.

A few practices that make migration significantly less painful:

Decouple embedding model version from index version. Maintain metadata alongside each stored vector that records which embedding model version produced it. This makes it possible to identify which records need recomputation during a migration and to validate that the new embeddings are consistent.

Build a recomputation pipeline from the start. The pipeline that computes embeddings for new records can also recompute embeddings for existing records. Building and testing this pipeline early means it's ready when you need it for a migration, rather than being built under time pressure.

Design for dual-index serving. A serving layer that can query two indices simultaneously — returning results from the new index where available and the old index for records not yet migrated — allows you to migrate incrementally rather than all-at-once. This is more complex to operate but dramatically reduces migration risk.

Test recall before committing to a new model. Before migrating production traffic to a new embedding model, build a test index on a representative sample of your data and measure recall against production queries. Embedding model quality improvements in benchmarks don't always translate to your specific domain and query distribution.

A Framework for Vector Search Operations

Before deploying vector search at scale — or before scaling a deployment that's already in production — validate against these questions:

On index architecture:

Do you have a plan for managing index degradation under continuous updates?
Is your architecture segment-based, or does it rely on periodic full rebuilds?
How do you handle the rebuild window without serving degraded results?

On monitoring:

Is recall tracked as a production metric, even via sampling?
Is latency per query monitored separately from overall system latency?
Do you have alerts for tombstone accumulation or index staleness?

On filtering:

Have you validated your filtering strategy against your actual query distribution?
Have you measured recall under your most selective filter combinations?

On embedding model management:

Are stored vectors tagged with the model version that produced them?
Do you have a recomputation pipeline for existing records?
Have you designed for dual-index serving during migrations?

Vector search infrastructure that's designed to answer these questions proactively is infrastructure that survives scale. Infrastructure that discovers the answers through production incidents is infrastructure that creates painful operational lessons.

In the final post, we pull all three pillars together and look at what it actually means to operate a real-time AI system at scale — latency budgets, observability, and knowing when your system is broken before your users tell you.