DEV Community

Cover image for Building Production RAG Pipelines: 3 Lessons from My First Year
Mahek Sayyad
Mahek Sayyad

Posted on

Building Production RAG Pipelines: 3 Lessons from My First Year

How I went from RAG beginner to production-ready, handling 50K+ queries per month


TL;DR

I spent a year building RAG (Retrieval-Augmented Generation) systems at TCS, going from experiments to production systems handling 50,000–80,000 HR queries monthly. Here are 3 hard-earned lessons about chunking, embedding models, and retrieval strategies that I wish I knew on day one.


The Problem: HR Was Drowning in Queries

Picture this: HR teams fielding 50,000 to 80,000 queries every month. Most were repetitive: "How do I apply for leave?" "What's the laptop policy?" "When is the next pay cycle?"

The knowledge existed in documents, but finding it was painfully slow.

Our team decided to build a RAG-powered system that could:

  1. Understand natural language questions
  2. Retrieve relevant policy documents
  3. Generate accurate, contextual answers

Here's what I learned.


Lesson 1: Chunking Strategy Makes or Breaks Your RAG

The Mistake

My first attempt? Split documents every 500 characters with 50-character overlap. Simple, right?

Result: Terrible retrieval. We were cutting sentences mid-way and losing context.

The Fix

I learned to respect semantic boundaries:

  • Preserve sentences: Don't split mid-sentence unless absolutely necessary
  • Paragraph awareness: Keep related sentences together
  • Context windows: Ensure each chunk has enough context to be understood independently
# What I implemented
def smart_chunk(text, chunk_size=512, overlap=50):
    """Chunk text while preserving sentence boundaries"""
    chunks = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))

        # Look for sentence boundary in overlap region
        if end < len(text):
            search_start = max(end - 100, start)
            sentence_end = text.rfind('. ', search_start, end)
            if sentence_end != -1:
                end = sentence_end + 1

        chunks.append(text[start:end].strip())
        start = end - overlap

    return chunks
Enter fullscreen mode Exit fullscreen mode

The Impact

40% reduction in manual HR queries after implementing semantic chunking.


Lesson 2: Not All Embedding Models Are Equal

The Mistake

I started with the first embedding model I found. It seemed to work in testing.

Result: In production, similar queries were retrieving completely wrong documents. "Leave policy" was returning "Laptop policy" because both contained words like "application" and "request."

The Fix

I learned to evaluate embeddings for my specific domain:

Model Size Speed Domain Fit
all-MiniLM-L6-v2 80MB Fast General purpose
all-mpnet-base-v2 400MB Medium Better semantic understanding
domain-specific Varies Varies Requires fine-tuning

What worked for us:

  • Start with all-MiniLM-L6-v2 for prototyping
  • Upgrade to all-mpnet-base-v2 for production
  • Consider domain fine-tuning if you have 10K+ documents

Key Insight

Test your embeddings with real queries from actual users. Synthetic test data lies.


Lesson 3: Hybrid Search Beats Pure Semantic Search

The Mistake

I relied solely on vector similarity search. It worked great for conceptual queries like "What's our remote work policy?"

Result: Failed miserably on specific queries like "Leave policy version 3.2" or acronyms like "PTO".

The Fix

I implemented hybrid search: combine semantic search with keyword matching.

# Simplified hybrid search approach
def hybrid_search(query, top_k=5):
    # Semantic search
    semantic_results = vector_store.similarity_search(query, k=top_k)

    # Keyword search (BM25 or simple TF-IDF)
    keyword_results = keyword_search(query, k=top_k)

    # Combine and re-rank
    combined = merge_results(semantic_results, keyword_results)
    return combined[:top_k]
Enter fullscreen mode Exit fullscreen mode

The Results

  • Conceptual queries β†’ Semantic search wins
  • Specific queries β†’ Keyword search wins
  • Hybrid β†’ Best of both worlds

Bonus: The Forgotten Piece - Evaluation

I almost forgot to mention: how do you know if your RAG is actually good?

I learned to track:

  • Retrieval accuracy: Did we get the right documents?
  • Answer relevance: Does the generated answer actually help?
  • User feedback: Simple πŸ‘/πŸ‘Ž buttons teach you more than metrics

Where I Am Now

That first RAG system? It's now handling queries across 3 different business workflows at TCS. The HR team? They went from drowning to actually having time for strategic work.

I'm now working on LangGraph-powered agentic workflows and still learning every day.


Key Takeaways

  1. Chunk smart, not just small - Respect semantic boundaries
  2. Test embeddings with real data - Your domain matters
  3. Hybrid search > pure semantic - Combine approaches
  4. Measure everything - You can't improve what you don't track

Your Turn

Building RAG systems? I'd love to hear about your challenges.

What's your biggest pain point with RAG right now?

  • Chunking strategies?
  • Embedding model selection?
  • Retrieval accuracy?
  • Something else?

Drop a commentβ€”let's learn together!


About Me

I'm Mahek, an AI Engineer at TCS with a passion for making complex AI accessible. When I'm not building RAG pipelines, I'm conducting workshops (upsilled 60+ engineers so far!) or writing about lessons from production systems.

πŸ”— LinkedIn | πŸ’» GitHub


Thanks for reading! If you found this helpful, consider following for more production AI insights.

Top comments (0)