I work on Microsoft Copilot's Search Infrastructure team, where I focus on semantic indexing and RAG (Retrieval-Augmented Generation). The challenges of building search at scale are fundamentally different from what you encounter in tutorials. Here's what building production RAG taught me — and how I applied those lessons to my open-source projects.
The Tutorial vs. Production Gap
Most RAG tutorials show this:
# Tutorial RAG
chunks = split_document(document)
embeddings = embed(chunks)
store_in_vector_db(embeddings)
# At query time
query_embedding = embed(query)
results = vector_db.search(query_embedding, top_k=5)
answer = llm.generate(query + results)
This works for a demo. It fails spectacularly in production because:
- Chunking strategy matters enormously — naive splitting breaks mid-sentence, mid-paragraph, mid-concept
- Embedding quality varies by domain — a model trained on web text performs poorly on legal contracts or medical records
- Top-k retrieval isn't enough — you need re-ranking, filtering, and relevance scoring
- Context window management — stuffing 5 chunks into a prompt wastes tokens on irrelevant content
- Freshness — documents update, and your index needs to stay current without full re-indexing
Lesson 1: Chunking Is an Art, Not a Split
The biggest mistake in RAG is treating chunking as text.split(max_length). Good chunking preserves:
- Semantic boundaries — paragraphs, sections, logical units
- Context — each chunk should be understandable in isolation
- Overlap — some repetition between chunks prevents information loss at boundaries
- Metadata — source document, section header, page number
class SemanticChunker:
def chunk(self, document: str, max_tokens: int = 512) -> list:
sections = self._split_by_headers(document)
chunks = []
for section in sections:
if self._token_count(section) <= max_tokens:
chunks.append(section)
else:
paragraphs = section.split('\n\n')
chunks.extend(self._merge_paragraphs(paragraphs, max_tokens))
return chunks
def _merge_paragraphs(self, paragraphs, max_tokens):
merged = []
current = ""
for para in paragraphs:
if self._token_count(current + para) <= max_tokens:
current += "\n\n" + para
else:
if current:
merged.append(current.strip())
current = para
if current:
merged.append(current.strip())
return merged
In my experience building production systems, domain-specific chunking strategies consistently outperform generic ones on retrieval relevance.
Lesson 2: Re-ranking Changes Everything
Vector similarity search returns "similar" results. Similar isn't the same as "relevant." A re-ranker bridges this gap:
def search_with_rerank(self, query: str, top_k: int = 5) -> list:
# Phase 1: Broad retrieval (cast a wide net)
candidates = self.vector_db.search(query, top_k=top_k * 3)
# Phase 2: Re-rank with cross-encoder
scored = []
for candidate in candidates:
score = self.reranker.score(query, candidate.text)
scored.append((candidate, score))
# Phase 3: Return top-k after re-ranking
scored.sort(key=lambda x: x[1], reverse=True)
return [item[0] for item in scored[:top_k]]
Retrieve 3x what you need, re-rank, then take the top results. This consistently improves answer quality by 15-25% over pure vector search.
Lesson 3: Hybrid Search Beats Pure Semantic
Pure embedding-based search misses exact matches. If a user searches for "error code E4012", semantic search might return results about "error handling" instead of the specific error code.
The solution is hybrid search: combine semantic similarity with keyword/BM25 matching:
def hybrid_search(self, query: str, alpha: float = 0.7) -> list:
semantic_results = self.vector_search(query)
keyword_results = self.bm25_search(query)
# Reciprocal Rank Fusion
combined = {}
for rank, result in enumerate(semantic_results):
combined[result.id] = combined.get(result.id, 0) + alpha / (rank + 60)
for rank, result in enumerate(keyword_results):
combined[result.id] = combined.get(result.id, 0) + (1 - alpha) / (rank + 60)
return sorted(combined.items(), key=lambda x: x[1], reverse=True)
In production RAG systems, hybrid search with Reciprocal Rank Fusion is a proven approach that consistently delivers better results.
Lesson 4: Evaluation Is Non-Negotiable
You can't improve what you can't measure. Every RAG system needs:
- Retrieval metrics: Recall@K, MRR, NDCG
- Generation metrics: Faithfulness, relevance, completeness
- End-to-end metrics: User satisfaction, task completion rate
def evaluate_retrieval(self, test_set: list) -> dict:
recall_at_5 = []
mrr = []
for query, expected_docs in test_set:
results = self.search(query, top_k=5)
result_ids = [r.id for r in results]
# Recall@5
hits = len(set(result_ids) & set(expected_docs))
recall_at_5.append(hits / len(expected_docs))
# MRR
for rank, rid in enumerate(result_ids):
if rid in expected_docs:
mrr.append(1.0 / (rank + 1))
break
else:
mrr.append(0.0)
return {
"recall@5": sum(recall_at_5) / len(recall_at_5),
"mrr": sum(mrr) / len(mrr)
}
Applying These Lessons to Open Source
I've applied lessons learned from working on production-scale search to my open-source projects:
- pdf-chat-assistant — RAG over PDF documents with semantic chunking and re-ranking
- personal-knowledge-base — Local RAG over your personal documents
- study-buddy-bot — RAG over textbook content for educational Q&A
The architecture is the same. The scale is different. But the patterns transfer perfectly.
Key Takeaways
- Don't skip chunking strategy — it's the most impactful optimization
- Always re-rank — the cost is minimal, the quality improvement is significant
- Use hybrid search — semantic + keyword catches what pure semantic misses
- Measure everything — build evaluation into your pipeline from day one
- Start local — you can build and test great RAG systems on a single machine
The full collection of RAG-powered tools is on GitHub. The patterns are open source. Build better search.
Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, where he builds semantic indexing and RAG systems at scale. He maintains 116+ open-source repositories. Read more on dev.to.
Top comments (0)