DEV Community

saurabh naik
saurabh naik

Posted on

Chunking for RAG: stop tuning the wrong knob

Every other week a new "smart" chunking strategy lands on AI Twitter — semantic, agentic, propositional, late chunking. Meanwhile the two boring knobs that actually move retrieval quality (chunk size and overlap) sit at whatever default a tutorial picked in 2023.

This post is for engineers shipping RAG who want a defensible chunking choice instead of a vibes-based one. By the end you'll have: a clear picture of what the recent research says, a working Python eval harness that compares chunking strategies on your own data, and a concrete production default to start from.

The chunking strategies, very briefly

There are basically four families in the wild:

  • Fixed-size: split every N tokens. Fastest, dumbest, cuts mid-sentence.
  • Recursive character splitting (LangChain's RecursiveCharacterTextSplitter): tries paragraph → sentence → word until chunks fit. The pragmatic default for prose.
  • Document-structure-aware: split on Markdown headers, HTML tags, or code AST nodes. Keeps logical sections intact.
  • Semantic chunking (LlamaIndex's SemanticSplitterNodeParser and friends): embed each sentence, cut where adjacent-embedding distance spikes past a percentile. Topically coherent, much more expensive.

The intuition for the last one is seductive — "let embeddings decide where ideas end." That's also the one that doesn't reliably pay off.

What recent research actually shows

Two independent results are worth knowing before you pick a strategy.

Chroma's chunking eval (Brandon Smith and Anton Troynikov) tested embedding-similarity splitters and LLM cluster chunkers against naive recursive and fixed-size chunking, scored with Intersection-over-Union and Recall on multiple corpora. The headline: semantic methods showed inconsistent, often negligible gains. Sometimes they lost. The dominant variables were chunk size and overlap, not the splitter. Default RecursiveCharacterTextSplitter at ~200–400 tokens was a strong baseline.

Databricks Mosaic AI's FinanceBench sweep went the other direction — fix the splitter (recursive), vary chunk size, measure answer correctness end-to-end:

  • 512-token chunks → ~36% correctness
  • 1024 → ~42%
  • 2048 → ~45%
  • 4096 → ~47%

Bumping overlap from 20% to 50% added less than a point and roughly doubled the index. In other words, larger chunks helped more than fancier splitting — and overlap mostly bought you a bigger index.

Anthropic's Contextual Retrieval is the one place "smart" preprocessing clearly paid off. Their move wasn't splitting cleverly; it was augmenting each chunk with ~50–100 tokens of LLM-generated context before embedding:

  • Contextual Embeddings alone: 35% fewer failed retrievals (5.7% → 3.7%)
  • Add Contextual BM25: 49% reduction
  • Add a reranker: 67% reduction
  • Indexing cost: ~$1.02 per million document tokens with Claude Haiku + prompt caching
  • Their sweet spot: 800 tokens, 100-token overlap

The pattern across all three: optimize the cheap knobs first (size, overlap), then augment chunks if you need more, and treat semantic splitting as a last resort.

A small eval harness you can actually run

You don't need a benchmark suite to make this call on your own corpus. Forty labeled (question, expected_snippet) pairs and an afternoon will do it. Here's the minimal harness.

# pip install langchain langchain-community sentence-transformers faiss-cpu
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

def build_index(docs, chunk_size, chunk_overlap):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.create_documents(docs)
    return FAISS.from_documents(chunks, embeddings)

def recall_at_k(index, eval_set, k=5):
    hits = 0
    for q, expected in eval_set:
        results = index.similarity_search(q, k=k)
        if any(expected.lower() in r.page_content.lower() for r in results):
            hits += 1
    return hits / len(eval_set)
Enter fullscreen mode Exit fullscreen mode

Now sweep:

docs = [open("corpus.txt").read()]  # your real corpus
eval_set = [
    ("What's the refund policy?", "refunds are issued within 14 days"),
    # ... 40 of these
]

for size in [256, 512, 1024, 2048]:
    for overlap_pct in [0.0, 0.1, 0.2]:
        idx = build_index(docs, size, int(size * overlap_pct))
        score = recall_at_k(idx, eval_set, k=5)
        print(f"size={size:>4} overlap={int(overlap_pct*100):>2}% recall@5={score:.3f}")
Enter fullscreen mode Exit fullscreen mode

The first time you run this on real data it's deflating in a useful way. Most teams discover the difference between their current setup and the best cell in this grid is bigger than the difference between any two splitter algorithms.

Note: "Expected snippet appears in any top-k chunk" is a coarse metric. It's fine for picking between configs; for production-grade evals you want a proper retrieval IoU or a downstream answer-correctness score, ideally with an LLM-as-judge over (question, retrieved_chunks, ground_truth_answer).

When semantic chunking is worth the bill

The Chroma study isn't a blanket "never use it." Semantic splitting helps when:

  • Your corpus is very heterogeneous in topic density — long technical docs that mix narrative explanation with dense reference tables, for example.
  • Your chunks need to be smaller than recursive splitting can keep coherent (e.g., 200-token chunks where every cut on a paragraph boundary truncates an idea).
  • You're already running cheap embeddings on every sentence for another reason.

If none of those apply, you're paying 10–100× the preprocessing cost to lose to a tuned recursive splitter.

Add context to chunks, not cleverness to splits

Once your size sweep stops moving the needle, the next lever isn't a fancier splitter — it's giving each chunk more context. Anthropic's Contextual Retrieval is the cleanest version of this idea:

CONTEXT_PROMPT = """<document>
{whole_document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Give a short (50-100 token) context that situates this chunk in the
overall document. Answer only with the context, nothing else."""

def contextualize(client, whole_doc, chunk):
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(
                whole_document=whole_doc, chunk=chunk
            ),
        }],
    )
    return msg.content[0].text

# Then embed: f"{context}\n\n{chunk}" instead of just chunk
Enter fullscreen mode Exit fullscreen mode

In production you almost certainly want prompt caching on whole_document — that's what gets the per-token indexing cost down to roughly $1 per million document tokens. Without caching, this approach is too expensive to be a default; with it, it's a reasonable line item.

You combine that with BM25 on the same contextualized text and a reranker on top of the union of dense + sparse hits, and you've reproduced most of the 67% retrieval-failure reduction Anthropic reported — without ever leaving recursive chunking.

Honest tradeoffs

A few things this post is not claiming:

  • That recursive chunking is optimal. It's a strong default. Your corpus might beat it with structure-aware splitting (Markdown headers, code AST) — that's worth trying before semantic chunking and is usually cheaper at index time too.
  • That bigger chunks are always better. The Mosaic AI sweep showed monotonic gains to 4096, but they were also running a long-context model. With an 8k-context generator, dumping 4k-token chunks limits how many you can stuff into the prompt. The right answer depends on your generator and your top-k.
  • That contextual retrieval is free. It costs an LLM call per chunk at index time. Worth it for high-value, slow-churn corpora (product docs, legal). Probably not for a corpus you re-index hourly.

Wrapping up

If you've never tuned chunking, the play is:

  1. Start with RecursiveCharacterTextSplitter, 1024 tokens, 10–20% overlap.
  2. Build a small (40–100) labeled eval set on your real corpus.
  3. Sweep chunk size and overlap. Pick the best cell.
  4. If retrieval is still the bottleneck, add contextual retrieval + BM25 + a reranker before you reach for semantic splitting.

The boring knob beats the smart algorithm in most real systems. Tune it.

What chunk size are you running in production — and is it a tuned number, or the default from a LangChain tutorial? Curious how many teams have actually swept this.

Top comments (0)