🥽 Deep Dive: Understanding Contextual Recall 🎯 in RAG Systems

#llm #machinelearning #ai #rag

Definition & Calculation

Contextual Recall measures the semantic completeness of your retrieval system. It is calculated by first generating a list of claims or statements from the Expected Output (Ground Truth) of the RAG system.

Using an LLM (typically the same model used to generate the claims), this list is mapped against the Actual Retrieved Chunks. The goal is to determine if the statements generated from the expected output can be semantically satisfied or concluded from the retrieved chunks.

The metric is calculated as:

A perfect score implies that all statements in the expected output are supported by the retrieved context, regardless of whether a single chunk satisfies multiple statements or multiple chunks satisfy a single statement.

The Problem with Typical Recall (Recall@K)

Contextual Recall solves the limitations of typical recall metrics (like Recall@K), which rely on comparing a static list of Expected Document IDs against the Actual Retrieved IDs.

In typical recall, if a specific chunk ID from the expected list is missing from the retrieval results, the system is penalized—even if the actual retrieved chunks contain the exact same information semantically.

This rigidity is critical because of Vector Space Volatility:

The Scenario: Consider two semantically similar queries, User Query A and User Query B.
The Vector Space: When mapped into the vector database, their positions will differ slightly. This slight difference is significant and can greatly affect the ANN (Approximate Nearest Neighbor) results.

The Outcome: Query A might retrieve chunks {A, B, C}, while Query B might retrieve only {A, B} because its position shifted just enough to place Chunk C outside its reach scope. In a typical recall metric, if Chunk C was "expected," Query B would be penalized. However, in Contextual Recall, if Chunks A and B provide sufficient information to answer the prompt (or if a different Chunk D provides the same info as Chunk C), the system is not penalized, resulting in a more accurate reflection of system performance.

What It Measures: Diagnosis & Root Causes

Low Contextual Recall evaluates the alignment between your Indexing Strategy (Chunking + Embedding) and your Retrieval Strategy (Query Embedding). A low score typically implies one of the following system failures:

1. Poor Embedding Model Performance

The embedding model may perform poorly due to its training methodology or dimensionality:

Training Bias: If the model was trained poorly, it may overly index or focus on specific terms in a chunk, skewing its numeric representation.
Low Dimensionality: If the embedding dimensions are too low for the complexity of the data, the model lacks the granularity to represent the chunk's meaning accurately.

2. Domain Mismatch (Non-Specialized Models)

This occurs when a general-purpose embedding model is used for specialized domains (e.g., medical or legal) containing heavy jargon.

Example: If a user searches for "ADHD," the query implies related concepts like "hyperactivity," "impulsiveness," and "inattention."
The Failure: A general-purpose model might not understand that a chunk describing "inattentive behavior" is semantically equivalent to "ADHD." Consequently, it fails to capture the relationship, ignores the relevant chunk, and degrades retrieval performance.

3. Chunking Strategy Failures

Improper chunking leads to improper positioning in the vector space, meaning the chunk's semantic value is misrepresented.

Under-Segmentation (Chunk Size Too Large): When a chunk is too large, it often includes multiple distinct claims or topics. The embedding model attempts to create a single vector that "balances" all these statements.
- The Consequence: The resulting vector might sit on the borderline between Topic A and Topic B. Even if the chunk contains significant information about Topic A, it is treated poorly during retrieval because its position is diluted by Topic B. This "averaging effect" lowers the confidence score for specific queries.

Over-Segmentation (Chunk Size Too Small): When a chunk is too small, it lacks sufficient context for the model to draw meaningful conclusions.
- The Consequence: The embedding model cannot accurately represent the information because the fragment is semantically incomplete, leading to a vector that effectively represents noise.

References: