Why Are Retrieval Strategies Important?
In the first six articles, we covered document chunking, embedding generation, and vector storage. Now suppose a user asks: "What are the best practices for Python asynchronous programming?"
Your vector database has 100,000 documents. The most naive approach is: do a similarity search and return the Top-K most similar documents.
But problems arise:
- Problem 1: Result duplication. The 5 returned articles might all talk about asyncio, with none covering aiohttp or practical pitfalls.
- Problem 2: Low-quality results mixed in. The 5th article might be semantically somewhat related, but actually discusses Go's concurrency model — useless for Python users.
- Problem 3: Queries with explicit conditions. The user asked for "articles about Python in 2024", but pure vector retrieval completely ignores the "2024" time constraint.
This article compares 4 retrieval strategies to help you solve these problems.
Four Retrieval Strategies at a Glance
| Strategy | Core Idea | Problem Solved | Best For |
|---|---|---|---|
| Similarity Search | Sort by vector similarity | Basic retrieval | General use |
| MMR | Balance relevance & diversity | Result duplication | Multi-angle answers needed |
| Threshold Filtering | Only keep high-similarity results | Low-quality mixing in | Quality over quantity |
| Self-Query | Parse query to generate filters | Explicit conditions in query | Time/category constraints |
Experiment Environment
We use 10 technical blog articles as test data, each with metadata (year, category, tags):
The source code is at the end of the article.
[
{"title": "Python Async Programming: From asyncio to aiohttp", "year": 2024, "category": "Backend"},
{"title": "2024 Python Performance Optimization Guide", "year": 2024, "category": "Backend"},
{"title": "JavaScript Async Programming: Promise and async/await", "year": 2023, "category": "Frontend"},
{"title": "2023 Frontend Framework Comparison: React vs Vue vs Angular", "year": 2023, "category": "Frontend"},
{"title": "Go Microservices: gRPC and Kubernetes", "year": 2024, "category": "Backend"},
{"title": "Rust Systems Programming: Memory Safety and Zero-Cost Abstractions", "year": 2023, "category": "Systems"},
{"title": "Python Machine Learning: From NumPy to PyTorch", "year": 2024, "category": "AI"},
{"title": "2024 Cloud Native Trends: Service Mesh and eBPF", "year": 2024, "category": "Cloud Native"},
{"title": "Database Selection Guide: PostgreSQL vs MySQL vs MongoDB", "year": 2023, "category": "Database"},
{"title": "Python Web Scraping: Scrapy vs Playwright", "year": 2024, "category": "Backend"}
]
Query: "Python asynchronous programming"
Strategy 1: Similarity Search
Principle
The most basic retrieval method. Convert the query text to a vector, find the K most similar documents in the vector store.
results = vectorstore.similarity_search("Python async programming", k=4)
Experiment Results
Retrieved 4 documents, covering 3 categories:
| Rank | Year | Category | Title |
|---|---|---|---|
| 1 | 2024 | Cloud Native | 2024 Cloud Native Trends: Service Mesh and eBPF |
| 2 | 2023 | Frontend | 2023 Frontend Framework Comparison: React vs Vue vs Angular |
| 3 | 2024 | Backend | Python Web Scraping: Scrapy vs Playwright |
| 4 | 2024 | Backend | 2024 Python Performance Optimization Guide |
Analysis
- ✅ Simple and direct, one line of code
- ❌ Results concentrated in few categories (Backend appears twice)
- ❌ May miss content from other relevant angles
Note: The top result is a "Cloud Native" article, which seems counter-intuitive. This is because the BGE model considers this article semantically related (both involve "technology trends" and "services"), but for humans it's clearly not precise enough. This is exactly why multiple strategies should be combined.
Strategy 2: MMR (Maximum Marginal Relevance)
Principle
MMR's core formula:
MMR = λ × Sim(query, di) - (1-λ) × max(Sim(di, dj))
- First term: Relevance between document di and the query (larger is better)
- Second term: Similarity between document di and already selected documents (smaller is better, ensures diversity)
- λ (lambda_mult): Balance parameter, 0.5 means equal weight for relevance and diversity
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "lambda_mult": 0.5, "fetch_k": 20},
)
fetch_k=20 means first select from 20 candidates, then use MMR to pick 4. Larger candidate pools yield better diversity.
Experiment Results
Retrieved 4 documents, covering 4 categories:
| Rank | Year | Category | Title |
|---|---|---|---|
| 1 | 2024 | Cloud Native | 2024 Cloud Native Trends: Service Mesh and eBPF |
| 2 | 2024 | Backend | Python Web Scraping: Scrapy vs Playwright |
| 3 | 2023 | Systems | Rust Systems Programming: Memory Safety and Zero-Cost Abstractions |
| 4 | 2023 | Database | Database Selection Guide: PostgreSQL vs MySQL vs MongoDB |
Comparison Analysis
| Metric | Similarity Search | MMR |
|---|---|---|
| Categories covered | 3 | 4 |
| Category list | Backend, Cloud Native, Frontend | Backend, Cloud Native, Systems, Database |
| Characteristic | Concentrated in few categories | More dispersed, more diverse |
MMR Parameter Tuning
# Only pursue relevance
search_kwargs={"k": 4, "lambda_mult": 1.0} # Equivalent to similarity search
# Only pursue diversity
search_kwargs={"k": 4, "lambda_mult": 0.0} # Results may not be very relevant
# Balance both (recommended)
search_kwargs={"k": 4, "lambda_mult": 0.5, "fetch_k": 20}
Strategy 3: Similarity Threshold Filtering
Principle
Only keep results whose similarity score (distance) exceeds a threshold; discard everything below.
Important: Chroma returns distance, not similarity scores. Smaller distance means more similar.
# First check distance distribution
results_with_score = vectorstore.similarity_search_with_score(query, k=10)
for doc, score in results_with_score:
print(f"distance={score:.4f} | {doc.metadata['title']}")
Distance Distribution (Measured)
distance=0.8652 | 2024 Cloud Native Trends: Service Mesh and eBPF
distance=0.8764 | 2023 Frontend Framework Comparison: React vs Vue vs Angular
distance=0.8833 | Python Web Scraping: Scrapy vs Playwright
distance=0.8857 | 2024 Python Performance Optimization Guide
distance=0.8906 | Python Machine Learning: From NumPy to PyTorch
distance=0.9019 | Rust Systems Programming: Memory Safety and Zero-Cost Abstractions
distance=0.9024 | Python Async Programming: From asyncio to aiohttp
distance=0.9145 | JavaScript Async Programming: Promise and async/await
distance=0.9147 | Database Selection Guide: PostgreSQL vs MySQL vs MongoDB
distance=0.9481 | Go Microservices: gRPC and Kubernetes
Manual Threshold Filtering
threshold = 0.89
filtered = [(doc, score) for doc, score in results_with_score if score <= threshold]
# Result: 4 documents (first 4 with distance <= 0.89)
Analysis
- ✅ Can filter out obviously irrelevant results (Go article at distance 0.9481)
- ⚠️ Threshold needs experimentation: too high = nothing returned, too low = no filtering effect
- 💡 Recommendation: Run a batch of queries to see distance distribution, then set the threshold
Strategy 4: Self-Query (Query Parsing + Metadata Filtering)
Principle
User queries often aren't pure semantic questions — they come with explicit conditions:
- "Articles about Python in 2024" → year=2024, tags=Python
- "Backend development category articles" → category=Backend
- "Frontend-related articles from 2023" → year=2023, category=Frontend
Self-Query core flow:
Natural language query → Parser → Structured filter conditions → Metadata filtering → Vector retrieval
Parser Implementation
Production environments can use LLM (like LangChain's SelfQueryRetriever) for parsing; here we use a rule-based parser to demonstrate the core logic:
def parse_query(query: str) -> dict:
filters = {}
semantic = query
# Extract year
if match := re.search(r'(20\d{2})\s*', query):
filters["year"] = int(match.group(1))
# Extract category
for cat in ["Backend", "Frontend", "Systems", ...]:
if cat in query:
filters["category"] = cat
# Extract tags
for tag in ["Python", "JavaScript", "Go", ...]:
if tag in query:
filters["tags"] = tag
return {"semantic_query": semantic, "filters": filters}
Experiment Results
Query 1: "Articles about Python in 2024"
Parse result:
Semantic query: Python
Filter conditions: {'year': 2024, 'tags': 'Python'}
After metadata filtering: 4 documents remain:
- Python Async Programming: From asyncio to aiohttp
- 2024 Python Performance Optimization Guide
- Python Machine Learning: From NumPy to PyTorch
- Python Web Scraping: Scrapy vs Playwright
Query 2: "Backend development category articles"
Parse result:
Semantic query: Backend development
Filter conditions: {'category': 'Backend'}
After metadata filtering: 4 documents remain:
- Python Async Programming: From asyncio to aiohttp
- 2024 Python Performance Optimization Guide
- Go Microservices: gRPC and Kubernetes
- Python Web Scraping: Scrapy vs Playwright
Query 3: "Frontend-related articles from 2023"
Parse result:
Semantic query: Frontend
Filter conditions: {'year': 2023}
After metadata filtering: 4 documents remain:
- JavaScript Async Programming: Promise and async/await
- 2023 Frontend Framework Comparison: React vs Vue vs Angular
- Rust Systems Programming: Memory Safety and Zero-Cost Abstractions
- Database Selection Guide: PostgreSQL vs MySQL vs MongoDB
Analysis
- ✅ Precisely responds to users' explicit conditions (time, category, tags)
- ✅ Filter first, then retrieve — dramatically reduces the scope of vector comparisons
- ⚠️ Parser quality determines effectiveness (rule-based vs LLM parsing)
Using LLM Parser in Production
from langchain.retrievers.self_query.base import SelfQueryRetriever
self_query_retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents="Technical blog articles",
metadata_field_info=[...], # Define metadata fields
)
results = self_query_retriever.invoke("Articles about Python in 2024")
Note: In LangChain 1.2.16 community packages, the module location of SelfQueryRetriever may vary. Please adjust the import path according to your installed version.
Four Strategies Comparison Summary
| Strategy | Best For | Core Parameters | Caveats |
|---|---|---|---|
| Similarity Search | General use, highest relevance | k |
Results may duplicate |
| MMR | Multi-angle answers needed |
lambda_mult, fetch_k
|
Parameters need tuning |
| Threshold Filtering | Quality over quantity | score_threshold |
Needs experimentation |
| Self-Query | Queries with explicit conditions | Parser quality | Rule-based or LLM parser |
Combined Usage Recommendation
In real production environments, combining strategies works best:
User query
↓
Self-Query parsing → Metadata filtering (narrow scope)
↓
Vector retrieval → MMR (ensure diversity)
↓
Threshold filtering (remove low-quality)
↓
Top-K results → LLM generates answer
# Combined example
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 5,
"lambda_mult": 0.5,
"fetch_k": 50,
"filter": {"year": 2024, "category": "Backend"} # Self-Query parsed conditions
}
)
Complete Code
The complete code for this article is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/07-retrieval-strategies
Core files:
-
retrieval_strategies.py— Complete comparison experiment of four retrieval strategies -
data/sample_articles.json— 10 test articles data
Summary
This article compared 4 retrieval strategies through code experiments:
- Similarity Search — Simple and direct, good for general scenarios
- MMR — Uses λ parameter to balance relevance and diversity, solves result duplication
- Threshold Filtering — Sets thresholds through distance distribution, filters out low-quality results
- Self-Query — Parses natural language into structured filter conditions, precisely responds to constrained queries
Key Insight: There is no best retrieval strategy, only the one most suitable for the current query. Combining Self-Query + MMR + threshold filtering is how you build a retrieval system that is both precise and comprehensive.
Top comments (0)