Mahek Sayyad

Posted on Apr 15

Building Production RAG Pipelines: 3 Lessons from My First Year

#rag #ai #programming #tutorial

How I went from RAG beginner to production-ready, handling 50K+ queries per month

TL;DR

I spent a year building RAG (Retrieval-Augmented Generation) systems at TCS, going from experiments to production systems handling 50,000–80,000 HR queries monthly. Here are 3 hard-earned lessons about chunking, embedding models, and retrieval strategies that I wish I knew on day one.

The Problem: HR Was Drowning in Queries

Picture this: HR teams fielding 50,000 to 80,000 queries every month. Most were repetitive: "How do I apply for leave?" "What's the laptop policy?" "When is the next pay cycle?"

The knowledge existed in documents, but finding it was painfully slow.

Our team decided to build a RAG-powered system that could:

Understand natural language questions
Retrieve relevant policy documents
Generate accurate, contextual answers

Here's what I learned.

Lesson 1: Chunking Strategy Makes or Breaks Your RAG

The Mistake

My first attempt? Split documents every 500 characters with 50-character overlap. Simple, right?

Result: Terrible retrieval. We were cutting sentences mid-way and losing context.

The Fix

I learned to respect semantic boundaries:

Preserve sentences: Don't split mid-sentence unless absolutely necessary
Paragraph awareness: Keep related sentences together
Context windows: Ensure each chunk has enough context to be understood independently

# What I implemented
def smart_chunk(text, chunk_size=512, overlap=50):
    """Chunk text while preserving sentence boundaries"""
    chunks = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))

        # Look for sentence boundary in overlap region
        if end < len(text):
            search_start = max(end - 100, start)
            sentence_end = text.rfind('. ', search_start, end)
            if sentence_end != -1:
                end = sentence_end + 1

        chunks.append(text[start:end].strip())
        start = end - overlap

    return chunks

The Impact

40% reduction in manual HR queries after implementing semantic chunking.

Lesson 2: Not All Embedding Models Are Equal

The Mistake

I started with the first embedding model I found. It seemed to work in testing.

Result: In production, similar queries were retrieving completely wrong documents. "Leave policy" was returning "Laptop policy" because both contained words like "application" and "request."

The Fix

I learned to evaluate embeddings for my specific domain:

Model	Size	Speed	Domain Fit
all-MiniLM-L6-v2	80MB	Fast	General purpose
all-mpnet-base-v2	400MB	Medium	Better semantic understanding
domain-specific	Varies	Varies	Requires fine-tuning

What worked for us:

Start with all-MiniLM-L6-v2 for prototyping
Upgrade to all-mpnet-base-v2 for production
Consider domain fine-tuning if you have 10K+ documents

Key Insight

Test your embeddings with real queries from actual users. Synthetic test data lies.

Lesson 3: Hybrid Search Beats Pure Semantic Search

The Mistake

I relied solely on vector similarity search. It worked great for conceptual queries like "What's our remote work policy?"

Result: Failed miserably on specific queries like "Leave policy version 3.2" or acronyms like "PTO".

The Fix

I implemented hybrid search: combine semantic search with keyword matching.

# Simplified hybrid search approach
def hybrid_search(query, top_k=5):
    # Semantic search
    semantic_results = vector_store.similarity_search(query, k=top_k)

    # Keyword search (BM25 or simple TF-IDF)
    keyword_results = keyword_search(query, k=top_k)

    # Combine and re-rank
    combined = merge_results(semantic_results, keyword_results)
    return combined[:top_k]

The Results

Conceptual queries → Semantic search wins
Specific queries → Keyword search wins
Hybrid → Best of both worlds

Bonus: The Forgotten Piece - Evaluation

I almost forgot to mention: how do you know if your RAG is actually good?

I learned to track:

Retrieval accuracy: Did we get the right documents?
Answer relevance: Does the generated answer actually help?
User feedback: Simple 👍/👎 buttons teach you more than metrics

Where I Am Now

That first RAG system? It's now handling queries across 3 different business workflows at TCS. The HR team? They went from drowning to actually having time for strategic work.

I'm now working on LangGraph-powered agentic workflows and still learning every day.

Key Takeaways

Chunk smart, not just small - Respect semantic boundaries
Test embeddings with real data - Your domain matters
Hybrid search > pure semantic - Combine approaches
Measure everything - You can't improve what you don't track

Your Turn

Building RAG systems? I'd love to hear about your challenges.

What's your biggest pain point with RAG right now?

Chunking strategies?
Embedding model selection?
Retrieval accuracy?
Something else?

Drop a comment—let's learn together!

About Me

I'm Mahek, an AI Engineer at TCS with a passion for making complex AI accessible. When I'm not building RAG pipelines, I'm conducting workshops (upsilled 60+ engineers so far!) or writing about lessons from production systems.

🔗 LinkedIn | 💻 GitHub

Thanks for reading! If you found this helpful, consider following for more production AI insights.

DEV Community

Building Production RAG Pipelines: 3 Lessons from My First Year

TL;DR

The Problem: HR Was Drowning in Queries

Lesson 1: Chunking Strategy Makes or Breaks Your RAG

The Mistake

The Fix

The Impact

Lesson 2: Not All Embedding Models Are Equal

The Mistake

The Fix

Key Insight

Lesson 3: Hybrid Search Beats Pure Semantic Search

The Mistake

The Fix

The Results

Bonus: The Forgotten Piece - Evaluation

Where I Am Now

Key Takeaways

Your Turn

About Me

Top comments (0)