How I went from RAG beginner to production-ready, handling 50K+ queries per month
TL;DR
I spent a year building RAG (Retrieval-Augmented Generation) systems at TCS, going from experiments to production systems handling 50,000β80,000 HR queries monthly. Here are 3 hard-earned lessons about chunking, embedding models, and retrieval strategies that I wish I knew on day one.
The Problem: HR Was Drowning in Queries
Picture this: HR teams fielding 50,000 to 80,000 queries every month. Most were repetitive: "How do I apply for leave?" "What's the laptop policy?" "When is the next pay cycle?"
The knowledge existed in documents, but finding it was painfully slow.
Our team decided to build a RAG-powered system that could:
- Understand natural language questions
- Retrieve relevant policy documents
- Generate accurate, contextual answers
Here's what I learned.
Lesson 1: Chunking Strategy Makes or Breaks Your RAG
The Mistake
My first attempt? Split documents every 500 characters with 50-character overlap. Simple, right?
Result: Terrible retrieval. We were cutting sentences mid-way and losing context.
The Fix
I learned to respect semantic boundaries:
- Preserve sentences: Don't split mid-sentence unless absolutely necessary
- Paragraph awareness: Keep related sentences together
- Context windows: Ensure each chunk has enough context to be understood independently
# What I implemented
def smart_chunk(text, chunk_size=512, overlap=50):
"""Chunk text while preserving sentence boundaries"""
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
# Look for sentence boundary in overlap region
if end < len(text):
search_start = max(end - 100, start)
sentence_end = text.rfind('. ', search_start, end)
if sentence_end != -1:
end = sentence_end + 1
chunks.append(text[start:end].strip())
start = end - overlap
return chunks
The Impact
40% reduction in manual HR queries after implementing semantic chunking.
Lesson 2: Not All Embedding Models Are Equal
The Mistake
I started with the first embedding model I found. It seemed to work in testing.
Result: In production, similar queries were retrieving completely wrong documents. "Leave policy" was returning "Laptop policy" because both contained words like "application" and "request."
The Fix
I learned to evaluate embeddings for my specific domain:
| Model | Size | Speed | Domain Fit |
|---|---|---|---|
| all-MiniLM-L6-v2 | 80MB | Fast | General purpose |
| all-mpnet-base-v2 | 400MB | Medium | Better semantic understanding |
| domain-specific | Varies | Varies | Requires fine-tuning |
What worked for us:
- Start with
all-MiniLM-L6-v2for prototyping - Upgrade to
all-mpnet-base-v2for production - Consider domain fine-tuning if you have 10K+ documents
Key Insight
Test your embeddings with real queries from actual users. Synthetic test data lies.
Lesson 3: Hybrid Search Beats Pure Semantic Search
The Mistake
I relied solely on vector similarity search. It worked great for conceptual queries like "What's our remote work policy?"
Result: Failed miserably on specific queries like "Leave policy version 3.2" or acronyms like "PTO".
The Fix
I implemented hybrid search: combine semantic search with keyword matching.
# Simplified hybrid search approach
def hybrid_search(query, top_k=5):
# Semantic search
semantic_results = vector_store.similarity_search(query, k=top_k)
# Keyword search (BM25 or simple TF-IDF)
keyword_results = keyword_search(query, k=top_k)
# Combine and re-rank
combined = merge_results(semantic_results, keyword_results)
return combined[:top_k]
The Results
- Conceptual queries β Semantic search wins
- Specific queries β Keyword search wins
- Hybrid β Best of both worlds
Bonus: The Forgotten Piece - Evaluation
I almost forgot to mention: how do you know if your RAG is actually good?
I learned to track:
- Retrieval accuracy: Did we get the right documents?
- Answer relevance: Does the generated answer actually help?
- User feedback: Simple π/π buttons teach you more than metrics
Where I Am Now
That first RAG system? It's now handling queries across 3 different business workflows at TCS. The HR team? They went from drowning to actually having time for strategic work.
I'm now working on LangGraph-powered agentic workflows and still learning every day.
Key Takeaways
- Chunk smart, not just small - Respect semantic boundaries
- Test embeddings with real data - Your domain matters
- Hybrid search > pure semantic - Combine approaches
- Measure everything - You can't improve what you don't track
Your Turn
Building RAG systems? I'd love to hear about your challenges.
What's your biggest pain point with RAG right now?
- Chunking strategies?
- Embedding model selection?
- Retrieval accuracy?
- Something else?
Drop a commentβlet's learn together!
About Me
I'm Mahek, an AI Engineer at TCS with a passion for making complex AI accessible. When I'm not building RAG pipelines, I'm conducting workshops (upsilled 60+ engineers so far!) or writing about lessons from production systems.
Thanks for reading! If you found this helpful, consider following for more production AI insights.
Top comments (0)