DEV Community

Atlas Whoff
Atlas Whoff

Posted on

RAG in production: the chunking and retrieval mistakes everyone makes

Most RAG tutorials get you to a prototype in 30 minutes. Most production RAG systems fail in ways those tutorials never prepare you for. After building several RAG pipelines, here are the real problems and how to fix them.

The demo problem

The basic RAG loop looks simple:

  1. Chunk documents → embed chunks → store in vector DB
  2. At query time: embed query → find similar chunks → stuff into prompt

This works great on the demo dataset. It fails in production because:

  • Chunk boundaries cut context in half
  • Retrieval returns semantically similar but contextually wrong chunks
  • The LLM hallucinates when retrieved context is insufficient
  • Performance degrades as the knowledge base grows

Problem 1: Naive chunking destroys context

The default CharacterTextSplitter(chunk_size=1000, chunk_overlap=200) pattern blindly splits on character count. It will cut a code example in half, split a numbered list between items 3 and 4, and separate a table header from its rows.

Better: semantic chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=[
        "\n## ",      # Split on H2 headers first
        "\n### ",     # Then H3
        "\n\n",       # Then paragraphs
        "\n",         # Then lines
        ". ",         # Then sentences
        " ",          # Last resort: spaces
    ],
    length_function=len,
)
Enter fullscreen mode Exit fullscreen mode

Better: preserve document structure as metadata

def chunk_with_metadata(document: str, source: str) -> list[dict]:
    chunks = splitter.split_text(document)
    return [
        {
            "content": chunk,
            "metadata": {
                "source": source,
                "chunk_index": i,
                "total_chunks": len(chunks),
                "section": extract_current_section(chunk, document),
            }
        }
        for i, chunk in enumerate(chunks)
    ]
Enter fullscreen mode Exit fullscreen mode

Problem 2: Vector similarity isn't enough

Pure cosine similarity retrieval has a well-known failure mode: it finds chunks that are topically similar but not the ones that answer the question.

Query: "How do I cancel my subscription?"
Top vector match: "Subscription plans include monthly and annual billing options"
Actual answer: "To cancel, go to Settings → Billing → Cancel subscription"

The relevant chunk scores lower because it uses different vocabulary.

Fix: hybrid retrieval (vector + BM25)

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 4

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Ensemble with reciprocal rank fusion
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6],  # Tune these per your domain
)
Enter fullscreen mode Exit fullscreen mode

BM25 finds exact keyword matches. Vector search finds semantic matches. The ensemble finds both.

Fix: re-ranking

After retrieval, re-rank chunks with a cross-encoder:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]
Enter fullscreen mode Exit fullscreen mode

This is cheap (small model, fast inference) and dramatically improves precision.

Problem 3: No feedback loop

Your RAG system is blind without measurement. At minimum, track:

from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class RAGTrace:
    query: str
    retrieved_chunks: list[str]
    retrieval_scores: list[float]
    response: str
    latency_ms: int
    user_feedback: Optional[bool] = None  # thumbs up/down
    timestamp: float = field(default_factory=time.time)

# Analyze:
# - Average retrieval score for queries that got negative feedback
# - Chunks retrieved often but never produce positive feedback (stale/wrong)
# - Query types that consistently fail
Enter fullscreen mode Exit fullscreen mode

Problem 4: Embedding model drift

If you update your embedding model, old vectors are incompatible. Track the model used per chunk:

{
    "content": "...",
    "embedding": [...],
    "metadata": {
        "embedding_model": "text-embedding-3-small",
        "embedding_model_version": "1",
        "indexed_at": "2026-04-07T18:00:00Z",
    }
}
Enter fullscreen mode Exit fullscreen mode

On model upgrade, re-embed in a new namespace and run both in parallel until confidence builds.

Problem 5: Context stuffing without ordering

Naively concatenating top-k chunks fails when chunks contradict each other or are from different sections. Use map-reduce for large retrievals:

# Map: extract relevant info from each chunk independently
# Reduce: synthesize the extracted pieces into a coherent answer

map_prompt = """Given this excerpt, extract information relevant to: {question}

Excerpt: {docs}

Relevant info (or "NOT RELEVANT"):"""

reduce_prompt = """Answer this question: {question}

Relevant excerpts:
{doc_summaries}

Comprehensive answer (say "insufficient information" if needed):"""
Enter fullscreen mode Exit fullscreen mode

The production RAG checklist

  • [ ] Semantic/structural chunking, not naive character split
  • [ ] Overlap preserves sentence boundaries
  • [ ] Hybrid retrieval (vector + BM25)
  • [ ] Re-ranking with cross-encoder
  • [ ] Chunk metadata: source, section, timestamp
  • [ ] Retrieval tracing logged per query
  • [ ] User feedback loop (even just thumbs up/down)
  • [ ] Embedding model versioned per chunk
  • [ ] Stale document removal pipeline
  • [ ] Evaluation set with ground-truth Q&A pairs

Building AI features into your SaaS? The AI SaaS Starter Kit at whoffagents.com includes a pre-built RAG pattern with pgvector, Next.js API routes, and streaming responses — so you skip the infrastructure and ship the feature.

Top comments (0)