Semantic Search at Scale: What I Learned Building RAG Infrastructure at Microsoft Copilot

#ai #python #todayisearched #machinelearning

I work on Microsoft Copilot's Search Infrastructure team, where I focus on semantic indexing and RAG (Retrieval-Augmented Generation). The challenges of building search at scale are fundamentally different from what you encounter in tutorials. Here's what building production RAG taught me — and how I applied those lessons to my open-source projects.

The Tutorial vs. Production Gap

Most RAG tutorials show this:

# Tutorial RAG
chunks = split_document(document)
embeddings = embed(chunks)
store_in_vector_db(embeddings)

# At query time
query_embedding = embed(query)
results = vector_db.search(query_embedding, top_k=5)
answer = llm.generate(query + results)

This works for a demo. It fails spectacularly in production because:

Chunking strategy matters enormously — naive splitting breaks mid-sentence, mid-paragraph, mid-concept
Embedding quality varies by domain — a model trained on web text performs poorly on legal contracts or medical records
Top-k retrieval isn't enough — you need re-ranking, filtering, and relevance scoring
Context window management — stuffing 5 chunks into a prompt wastes tokens on irrelevant content
Freshness — documents update, and your index needs to stay current without full re-indexing

Lesson 1: Chunking Is an Art, Not a Split

The biggest mistake in RAG is treating chunking as text.split(max_length). Good chunking preserves:

Semantic boundaries — paragraphs, sections, logical units
Context — each chunk should be understandable in isolation
Overlap — some repetition between chunks prevents information loss at boundaries
Metadata — source document, section header, page number

class SemanticChunker:
    def chunk(self, document: str, max_tokens: int = 512) -> list:
        sections = self._split_by_headers(document)
        chunks = []
        for section in sections:
            if self._token_count(section) <= max_tokens:
                chunks.append(section)
            else:
                paragraphs = section.split('\n\n')
                chunks.extend(self._merge_paragraphs(paragraphs, max_tokens))
        return chunks

    def _merge_paragraphs(self, paragraphs, max_tokens):
        merged = []
        current = ""
        for para in paragraphs:
            if self._token_count(current + para) <= max_tokens:
                current += "\n\n" + para
            else:
                if current:
                    merged.append(current.strip())
                current = para
        if current:
            merged.append(current.strip())
        return merged

In my experience building production systems, domain-specific chunking strategies consistently outperform generic ones on retrieval relevance.

Lesson 2: Re-ranking Changes Everything

Vector similarity search returns "similar" results. Similar isn't the same as "relevant." A re-ranker bridges this gap:

def search_with_rerank(self, query: str, top_k: int = 5) -> list:
    # Phase 1: Broad retrieval (cast a wide net)
    candidates = self.vector_db.search(query, top_k=top_k * 3)

    # Phase 2: Re-rank with cross-encoder
    scored = []
    for candidate in candidates:
        score = self.reranker.score(query, candidate.text)
        scored.append((candidate, score))

    # Phase 3: Return top-k after re-ranking
    scored.sort(key=lambda x: x[1], reverse=True)
    return [item[0] for item in scored[:top_k]]

Retrieve 3x what you need, re-rank, then take the top results. This consistently improves answer quality by 15-25% over pure vector search.

Lesson 3: Hybrid Search Beats Pure Semantic

Pure embedding-based search misses exact matches. If a user searches for "error code E4012", semantic search might return results about "error handling" instead of the specific error code.

The solution is hybrid search: combine semantic similarity with keyword/BM25 matching:

def hybrid_search(self, query: str, alpha: float = 0.7) -> list:
    semantic_results = self.vector_search(query)
    keyword_results = self.bm25_search(query)

    # Reciprocal Rank Fusion
    combined = {}
    for rank, result in enumerate(semantic_results):
        combined[result.id] = combined.get(result.id, 0) + alpha / (rank + 60)
    for rank, result in enumerate(keyword_results):
        combined[result.id] = combined.get(result.id, 0) + (1 - alpha) / (rank + 60)

    return sorted(combined.items(), key=lambda x: x[1], reverse=True)

In production RAG systems, hybrid search with Reciprocal Rank Fusion is a proven approach that consistently delivers better results.

Lesson 4: Evaluation Is Non-Negotiable

You can't improve what you can't measure. Every RAG system needs:

Retrieval metrics: Recall@K, MRR, NDCG
Generation metrics: Faithfulness, relevance, completeness
End-to-end metrics: User satisfaction, task completion rate

def evaluate_retrieval(self, test_set: list) -> dict:
    recall_at_5 = []
    mrr = []

    for query, expected_docs in test_set:
        results = self.search(query, top_k=5)
        result_ids = [r.id for r in results]

        # Recall@5
        hits = len(set(result_ids) & set(expected_docs))
        recall_at_5.append(hits / len(expected_docs))

        # MRR
        for rank, rid in enumerate(result_ids):
            if rid in expected_docs:
                mrr.append(1.0 / (rank + 1))
                break
        else:
            mrr.append(0.0)

    return {
        "recall@5": sum(recall_at_5) / len(recall_at_5),
        "mrr": sum(mrr) / len(mrr)
    }

Applying These Lessons to Open Source

I've applied lessons learned from working on production-scale search to my open-source projects:

pdf-chat-assistant — RAG over PDF documents with semantic chunking and re-ranking
personal-knowledge-base — Local RAG over your personal documents
study-buddy-bot — RAG over textbook content for educational Q&A

The architecture is the same. The scale is different. But the patterns transfer perfectly.

Key Takeaways

Don't skip chunking strategy — it's the most impactful optimization
Always re-rank — the cost is minimal, the quality improvement is significant
Use hybrid search — semantic + keyword catches what pure semantic misses
Measure everything — build evaluation into your pipeline from day one
Start local — you can build and test great RAG systems on a single machine

The full collection of RAG-powered tools is on GitHub. The patterns are open source. Build better search.

Nrk Raju Guthikonda is a Senior Software Engineer at Microsoft on the Copilot Search Infrastructure team, where he builds semantic indexing and RAG systems at scale. He maintains 116+ open-source repositories. Read more on dev.to.