**A deep dive into building production-ready Eternal Contextual RAG with hybrid search and automatic knowledge expansion
**
The Breaking Point
A developer was building a RAG chatbot for students studying Indian civics. The goal was simple: answer questions from NCERT textbooks using retrieval-augmented generation.
Everything seemed perfect. The vector database was humming. The embeddings looked good. The LLM was responding fast.
Then a student asked: "What protects Indian citizens?"
The system replied: "No relevant information found."
But the answer was in the database:
"Article 21 guarantees protection of life and personal liberty."
The chunk existed. The similarity search worked. So why did retrieval fail?
This problem turned out to be far more common than expected. After testing 13 different queries, 8 failed completely. A 40% failure rate.
Something was fundamentally broken.
The Root Cause: Context-Blind Embeddings
Traditional RAG systems embed chunks in complete isolation:
# What gets embedded
chunk = "Article 21 guarantees protection of life and personal liberty."
embedding = embed_model.encode(chunk)
# → [0.234, -0.123, 0.456, ...]
When someone searches for "What protects Indian citizens?", the system compares:
-
Query:
[citizen, protection, rights, safeguards] -
Chunk:
[Article 21, guarantees, protection, life, liberty]
The semantic overlap? Minimal.
Why? Because the chunk has no idea:
- It's from the Indian Constitution
- It's in the Fundamental Rights chapter
- It's explaining citizen protections
- It's a legal safeguard against state action
The chunk is context-blind. And according to research from Anthropic, this causes retrieval failures in approximately 40% of queries.
The Anthropic Insight: Contextual Retrieval
In their September 2024 research, Anthropic proposed a deceptively simple solution:
Before embedding a chunk, use an LLM to explain where it fits in the document.
Instead of embedding:
"Article 21 guarantees protection of life and personal liberty."
Embed:
"This chunk from the Fundamental Rights chapter of the Indian
Constitution explains Article 21, one of the most important
constitutional provisions. It guarantees citizens' right to
life and personal liberty, protecting them against arbitrary
state action. Courts have interpreted this broadly to include
rights to education, health, and privacy.
Article 21 guarantees protection of life and personal liberty."
The impact?
- 49% reduction in retrieval failures
- 67% reduction when combined with reranking
The developer decided to implement this approach—and go further.
Building the Three-Layer Architecture
Layer 1: Intelligent Context Generation
Every chunk is processed through an LLM before embedding:
def generate_chunk_context(chunk, full_document, document_name):
"""
Generate contextual description explaining where
this chunk fits in the document.
"""
prompt = f"""
<document>
{full_document}
</document>
<chunk>
{chunk}
</chunk>
Give a short context (2-3 sentences) to situate this
chunk within the overall document for search retrieval.
The context should:
- Explain what this chunk is about
- Mention the document it's from ({document_name})
- Help someone searching for this information find it
Answer only with the context.
"""
response = llm.generate(prompt)
return response.text.strip()
Example transformation:
Before:
"The movement gained momentum in 1920."
After:
"This chunk from the History of Indian Independence Movement
describes a turning point when Mahatma Gandhi's return from
South Africa in 1920 catalyzed the freedom struggle and marked
the beginning of mass civil disobedience campaigns.
The movement gained momentum in 1920."
Now the chunk is discoverable by searches like:
- "Gandhi's impact on independence"
- "Freedom struggle turning points"
- "1920s India civil disobedience"
- "Independence movement acceleration"
Layer 2: Hybrid Search Strategy
Context makes chunks discoverable, but retrieval needs to be multi-dimensional.
The solution? Elasticsearch with simultaneous vector and keyword search.
def hybrid_search(es_client, index, query, top_k=20):
"""
Perform hybrid search combining kNN and BM25.
"""
query_embedding = embed_model.encode(query)
search_query = {
"size": top_k,
"query": {
"bool": {
"should": [
# BM25 keyword search
{
"multi_match": {
"query": query,
"fields": [
"contextualized_chunk^2",
"original_chunk"
],
"boost": 0.4 # 40% weight
}
}
]
}
},
"knn": {
"field": "embedding",
"query_vector": query_embedding,
"k": top_k,
"num_candidates": top_k * 10,
"boost": 0.6 # 60% weight
}
}
results = es_client.search(index=index, body=search_query)
return process_results(results)
The scoring formula:
final_score = (0.6 × vector_similarity) + (0.4 × bm25_score)
This catches:
- Semantic matches: "What protects citizens?" → Article 21 (vector search)
- Exact terms: "Article 21 protections" → Article 21 (BM25 search)
- Synonym variations: "citizen safeguards" → protection clauses (hybrid)
Layer 3: Reranking + Dynamic Knowledge Expansion
Step 1: Precision Reranking
Elasticsearch returns 20 candidates. A reranking model evaluates each one:
def rerank_results(query, results, top_n=5):
"""
Rerank results using specialized reranking model.
"""
documents = [r["contextualized_chunk"] for r in results]
rerank_response = rerank_model.rerank(
query=query,
documents=documents,
top_n=top_n
)
# Map back to original results with new scores
reranked = []
for item in rerank_response.results:
original = results[item.index].copy()
original["rerank_score"] = item.relevance_score
original["score"] = item.relevance_score
reranked.append(original)
return reranked
This typically provides a 15-20% relevance boost.
Step 2: Automatic Knowledge Expansion
Here's where things get interesting:
def query_pipeline(query, min_confidence=0.65):
"""
Query with automatic web search fallback.
"""
# Initial search
results = hybrid_search(es_client, index, query, top_k=20)
results = rerank_results(query, results, top_n=10)
# Check confidence
confidence = results[0]['rerank_score'] if results else 0.0
# Low confidence? Search the web
if confidence < min_confidence:
print(f"Low confidence ({confidence:.2f}). Searching web...")
# 1. Search web using LLM with grounding
web_content = web_search(query)
# 2. Chunk and contextualize new information
new_chunks = chunk_document(web_content)
contextualized = contextualize_chunks(
new_chunks, web_content, f"web::{query}"
)
# 3. Embed and index
for chunk in contextualized:
chunk['embedding'] = embed_model.encode(
chunk['contextualized_chunk']
)
index_documents(es_client, index, contextualized)
# 4. Re-search with expanded knowledge
results = hybrid_search(es_client, index, query, top_k=20)
results = rerank_results(query, results, top_n=10)
return generate_answer(query, results[:5])
The system never says "I don't know". It automatically:
- Detects low confidence
- Searches the web
- Contextualizes new findings
- Expands the knowledge base
- Re-searches with enhanced data
When to Use This Approach
✅ Ideal Use Cases
Educational Content
- Textbooks, course materials, study guides
- Lecture notes, academic papers
- Training documentation
Enterprise Knowledge Bases
- Company wikis, internal documentation
- Policy documents, procedures
- Historical decisions, meeting notes
Research & Analysis
- Literature reviews, paper summaries
- Market research, competitor analysis
- Technical documentation
Customer Support
- Product manuals, FAQs
- Troubleshooting guides
- Knowledge base articles
Personal Knowledge Management
- Note-taking systems
- Journal entries, personal docs
- Curated article collections
The complete code, notebooks, and documentation are available on GitHub.
Key resources:
- Full Python implementation
- Runnable Colab notebook
- Architecture documentation
- Example datasets
Clone, customize, and deploy in under an hour.
Further Reading:
- Anthropic's Contextual Retrieval Research
- Elasticsearch Hybrid Search Guide
- Understanding Vector Embeddings
Tags: #rag #machinelearning #ai #llm #vectorsearch #nlp #elasticsearch #python #opensource #genai
Top comments (0)