Akshay Rajinikanth

Posted on Feb 19

When Similarity Search Breaks: Why RAG Fails on Numerical Queries

#machinelearning #ai #beginners #datascience

I was building a chatbot using Retrieval-Augmented Generation (RAG) over a semi-structured insurance database. The system answered questions about policies, coverage, reviews, and claims history. During testing, I asked: “What health problems commonly result in insurance claims over $1,000?” The chatbot confidently listed diabetes complications ($847), minor surgeries ($654), and preventive care ($423) all below the requested threshold.

This was not an isolated mistake. To investigate, I created a minimal test case using 20 smartphones and built a basic RAG system. When asked “Phones released in 2024,” the system returned iPhone SE (2022), Nothing Phone (2) (2023), and Samsung Galaxy A54 (2023). The retrieved context looked relevant, yet it violated the numeric constraint.

If you build RAG applications, you’ve likely encountered this pattern: semantic questions work reliably, while queries involving numbers, dates, or quantities fail systematically. The issue is not the language model’s reasoning; it is the retrieval step. Embedding models encode numbers as tokens rather than ordered values, and vector search retrieves semantically similar documents instead of numerically valid ones. In embedding space, “$499” and “$999” can appear close despite representing very different quantities.

This behavior has practical implications. Systems used in finance, healthcare, compliance, or analytics dashboards often rely on thresholds and filters; retrieving the wrong evidence can produce confident but incorrect conclusions. The failure stems from similarity search optimizing semantic closeness rather than respecting structured constraints. In this article, I will examine why this happens and how to address it in practical RAG pipelines.

Why Embeddings Struggle with Hard Constraints

This failure is not a bug but a consequence of how similarity is computed in embedding space.

When text is embedded, the transformer maps it to a point in a high-dimensional vector space, positioning semantically similar chunks close together. Retrieval is then performed by ranking chunks using cosine similarity. In my project, the retrieved recommendations were topically relevant to the query, but similarity was dominated by shared product descriptions rather than the release year, producing results with correct context but incorrect dates. The model learns that “$999” and “$499” are related as prices, yet it does not encode their numerical relationship.

A simple observation explains this behavior: embedding models capture semantic distinctions such as “cheap” vs. “expensive,” categorical groupings like “budget” vs. “flagship,” and even recognize that “$499” and “$999” represent prices, but they do not represent ordered relationships such as magnitude ($299 < $500 < $799 < $999), temporal sequence (2022 < 2023 < 2024), or quantitative comparison (64GB < 128GB < 256GB).

On a traditional number line, ordering is explicit: values have fixed relative positions.

The key point is that the model isn’t making a mistake, the search space itself is wrong for constraint-based queries. In my insurance project I tried prompt engineering, few-shot examples, switching models, even tuning temperature, but nothing improved the results. By the time the language model received the retrieved documents, the incorrect candidates had already been selected. The generation step was operating on flawed evidence. In practice, I was effectively asking the model to choose phones under $500 from a candidate set containing an iPhone 15 Pro ($999) and a Samsung S24 ($799). The failure occurred before reasoning began.

Hybrid Retrieval: Separating Constraints from Semantic Search

The failure reveals design mismatch: semantic similarity and logical constraints are different problems. The vector search excels at understanding meaning, ranking relevance and capturing semantics, but it does not enforce logical constraints, numerical comparison and number matching.

Instead of forcing embeddings and similarity search to handle both, we separate responsibilities: embeddings determine relevance, while structured filters enforce validity.

Modern vector databases support metadata filtering alongside vector search capabilities. Structured constraints are applied first to reduce the candidate set, after which semantic search is applied to find relevant results. This prevents numerically invalid documents from ever entering the retrieval stage and improves both accuracy and latency.

Let us see the implementation in 3 simple steps:

Step 1: Extract the metadata

def extract_metadata(text: str) -> dict:
    """Parse structured data from text."""
    return {
        'price': float(re.search(r'\$(\d+)', text).group(1)),
        'release_year': int(re.search(r'(\d{4})-', text).group(1)),
        'category': next(c for c in ['budget', 'flagship', 'premium'] if c in text.lower())
    }

# Store with both embedding and metadata
doc = Document(
    page_content=content,
    metadata=extract_metadata(content)  # ← Indexed separately
)

The metadata extraction can be implemented using Regex (Rule-Based extraction method) which is reliable and extremely fast and reliable on semi-structured data. Named entity recognition models, LLMs or hybrid of both can also be leveraged to work on pure unstructured data at the cost of latency and occasional formatting errors.

Step 2: Parsing constraints from queries

def parse_constraints(query: str) -> dict:
    """
    Convert natural language constraints into database filters.

    Examples:
        "phones under $500"  -> {'price': {'$lte': 500}}
        "phones in 2024"     -> {'release_year': 2024}
    """
    filters = {}

    # Price: under / below / less than
    if match := re.search(r'(?:under|below|less than)\s*\$(\d+)', query, re.I):
        filters['price'] = {'$lte': int(match.group(1))}

    # Price: over / above / more than
    if match := re.search(r'(?:over|above|more than)\s*\$(\d+)', query, re.I):
        filters['price'] = {'$gte': int(match.group(1))}

    # Year constraint
    if match := re.search(r'(?:in|from)\s*(\d{4})', query):
        filters['release_year'] = int(match.group(1))

    return filters

Here the query is converted into structured filters. Filter constraint value is parsed using Regex in the above example. In production systems this step may also be implemented using tool-calling, allowing the model to output structured parameters instead of raw text.

Step 3: Applying the filters before searching

def search(query: str, k: int = 5):
    """
    Apply structured filters first, then rank remaining results by semantic similarity.
    """
    filters = parse_constraints(query)

    return vectorstore.similarity_search(
        query,
        k=k,
        filter=filters  # ChromaDB applies metadata filtering BEFORE vector search
    )

The vector database first narrows the candidate set using metadata constraints, and only then performs semantic ranking on the filtered results. As a result, the language model receives only valid evidence, eliminating the earlier failure mode where reasoning operated on incorrect context.

Evaluation: Constraint Satisfaction Before vs After Filtering

To evaluate the behavior, I used a synthetic semi-structured dataset of 20 smartphones and queried the system using top-k retrieval. A result was marked correct only if all returned items satisfied the numeric constraint. Both systems used the same embedding model, LLM, and prompts only the retrieval strategy was modified.

Query	Basic RAG	Metadata RAG	Improvement
Show me phones under $500	60%	100%	+40%
Phones released in 2024	0%	100%	+100%
Budget phones under $400	20%	100%	+80%
Flagship phones between $700 and $900	20%	100%	+80%

The pattern is consistent: the table shows that the basic RAG significantly underperforms on constraint-based queries, often returning correct answers only by a random chance. After adding metadata filtering, every query satisfies its numerical conditions, demonstrating that separating structured constraints from semantic retrieval turns RAG into a reliable system for quantitative questions.

In the implementation we did not change the embedding model, LLM, or the prompts. We just modified the retrieval objective.

The above is the output of the basic RAG code. The standard semantic RAG retrieves contextually relevant smartphones but ignores the numerical constraint. Results include items outside the requested range because the embedding search treats “$500” as context, not a hard filter.

The above is the output of metadata-aware RAG. The system first applies structured filters (price, year, category) and only then performs semantic ranking. Every returned result now satisfies the constraint, showing correct retrieval instead of approximate semantic matches.

Separating constraints from semantic retrieval enables:

Composable multi-constraint queries
Scalable filtering on large datasets
Deterministic enforcement of business rules

Notably, the embedding model, language model, and prompts remained unchanged — only the retrieval objective was modified. The performance gain comes from correcting evidence selection rather than improving reasoning.

Other Ways to Handle Constraints in RAG

Metadata filtering is not the only strategy to handle numerical queries in RAG, but it differs from others in an important way: it enforces constraints before retrieval rather than correcting them afterward.

A common workaround is post-retrieval filtering: retrieve a larger candidate set (for example, top 20 or top 50) using pure vector search, then ask the LLM to remove invalid results. This helps, but it wastes retrieval budget on irrelevant documents, and the model still misjudges boundaries (“around $500” vs “under $500”). The behavior remains a probabilistic rather than reliable.

Another attempt is query rewriting. For example, transforming “phones under $500” into phrases like “cheap affordable budget low-cost phones below 500 dollars.” This shifts similarity toward cheaper items, yet semantic closeness is not equivalent to numerical correctness and high-priced phone often remain in the candidate set.

You can also apply a programmatic filter after retrieval:
[r for r in results if r.metadata['price'] < 500]

This removes invalid outputs, but it requires retrieving several times more documents than necessary. When valid items are rare, the system may still fail simply because the correct documents were never retrieved.

Hybrid dense-sparse search (BM25 combined with vectors) helps exact keyword matching, yet numbers remain tokens rather than ordered quantities such as “500” and “999” are matched lexically, not numerically.

Finally, Embeddings can be fine-tuned on numerical data, so the model learns ordering relationships. However, this increases training cost and still produces probabilistic improvements rather than guaranteed constraint satisfaction.

Why metadata filtering wins

Metadata filtering changes the retrieval objective itself. Instead of encouraging the model to respect numerical constraints, it enforces them before semantic ranking occurs. Modern vector databases support structured filtering alongside similarity search because applying constraints first reduces the candidate space and prevents invalid evidence from entering the reasoning stage.

In other words, most alternatives attempt to persuade the model to behave correctly, whereas metadata filtering ensures the system operates only on valid inputs.

The Real Problem Was Retrieval

When my insurance chatbot answered “claims over $1,000” with results under $1,000, I initially treated it as a prompting problem. I tried clearer instructions, added examples, and switched models, nothing changed.
The issue was not the language model but the retrieval stage. The system never retrieved numerically valid documents, so the model reasoned correctly over incorrect evidence. Embeddings place $499 and $999 near each other as “prices,” not as ordered quantities.

The resolution was architectural rather than model-based: metadata enforces constraints, while vector search determines relevance. Once those responsibilities were separated, the system became predictable instead of approximate.

The broader lesson extends beyond RAG. When building AI systems, improving reasoning often means improving evidence selection. Many apparent model failures are retrieval failures in disguise, and reliability comes from structuring the search space, not from making the model larger.

Any AI system that retrieves probabilistic context for deterministic requirements will appear to “reason badly” even when the reasoning is correct.

DEV Community