Chunking for RAG: stop tuning the wrong knob

#ai #python #rag #llm

Every other week a new "smart" chunking strategy lands on AI Twitter — semantic, agentic, propositional, late chunking. Meanwhile the two boring knobs that actually move retrieval quality (chunk size and overlap) sit at whatever default a tutorial picked in 2023.

This post is for engineers shipping RAG who want a defensible chunking choice instead of a vibes-based one. By the end you'll have: a clear picture of what the recent research says, a working Python eval harness that compares chunking strategies on your own data, and a concrete production default to start from.

The chunking strategies, very briefly

There are basically four families in the wild:

Fixed-size: split every N tokens. Fastest, dumbest, cuts mid-sentence.
Recursive character splitting (LangChain's RecursiveCharacterTextSplitter): tries paragraph → sentence → word until chunks fit. The pragmatic default for prose.
Document-structure-aware: split on Markdown headers, HTML tags, or code AST nodes. Keeps logical sections intact.
Semantic chunking (LlamaIndex's SemanticSplitterNodeParser and friends): embed each sentence, cut where adjacent-embedding distance spikes past a percentile. Topically coherent, much more expensive.

The intuition for the last one is seductive — "let embeddings decide where ideas end." That's also the one that doesn't reliably pay off.

What recent research actually shows

Two independent results are worth knowing before you pick a strategy.

Chroma's chunking eval (Brandon Smith and Anton Troynikov) tested embedding-similarity splitters and LLM cluster chunkers against naive recursive and fixed-size chunking, scored with Intersection-over-Union and Recall on multiple corpora. The headline: semantic methods showed inconsistent, often negligible gains. Sometimes they lost. The dominant variables were chunk size and overlap, not the splitter. Default RecursiveCharacterTextSplitter at ~200–400 tokens was a strong baseline.

Databricks Mosaic AI's FinanceBench sweep went the other direction — fix the splitter (recursive), vary chunk size, measure answer correctness end-to-end:

512-token chunks → ~36% correctness
1024 → ~42%
2048 → ~45%
4096 → ~47%

Bumping overlap from 20% to 50% added less than a point and roughly doubled the index. In other words, larger chunks helped more than fancier splitting — and overlap mostly bought you a bigger index.

Anthropic's Contextual Retrieval is the one place "smart" preprocessing clearly paid off. Their move wasn't splitting cleverly; it was augmenting each chunk with ~50–100 tokens of LLM-generated context before embedding:

Contextual Embeddings alone: 35% fewer failed retrievals (5.7% → 3.7%)
Add Contextual BM25: 49% reduction
Add a reranker: 67% reduction
Indexing cost: ~$1.02 per million document tokens with Claude Haiku + prompt caching
Their sweet spot: 800 tokens, 100-token overlap

The pattern across all three: optimize the cheap knobs first (size, overlap), then augment chunks if you need more, and treat semantic splitting as a last resort.

A small eval harness you can actually run

You don't need a benchmark suite to make this call on your own corpus. Forty labeled (question, expected_snippet) pairs and an afternoon will do it. Here's the minimal harness.

# pip install langchain langchain-community sentence-transformers faiss-cpu
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

def build_index(docs, chunk_size, chunk_overlap):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.create_documents(docs)
    return FAISS.from_documents(chunks, embeddings)

def recall_at_k(index, eval_set, k=5):
    hits = 0
    for q, expected in eval_set:
        results = index.similarity_search(q, k=k)
        if any(expected.lower() in r.page_content.lower() for r in results):
            hits += 1
    return hits / len(eval_set)

Now sweep:

docs = [open("corpus.txt").read()]  # your real corpus
eval_set = [
    ("What's the refund policy?", "refunds are issued within 14 days"),
    # ... 40 of these
]

for size in [256, 512, 1024, 2048]:
    for overlap_pct in [0.0, 0.1, 0.2]:
        idx = build_index(docs, size, int(size * overlap_pct))
        score = recall_at_k(idx, eval_set, k=5)
        print(f"size={size:>4} overlap={int(overlap_pct*100):>2}% recall@5={score:.3f}")

The first time you run this on real data it's deflating in a useful way. Most teams discover the difference between their current setup and the best cell in this grid is bigger than the difference between any two splitter algorithms.

Note: "Expected snippet appears in any top-k chunk" is a coarse metric. It's fine for picking between configs; for production-grade evals you want a proper retrieval IoU or a downstream answer-correctness score, ideally with an LLM-as-judge over (question, retrieved_chunks, ground_truth_answer).

When semantic chunking is worth the bill

The Chroma study isn't a blanket "never use it." Semantic splitting helps when:

Your corpus is very heterogeneous in topic density — long technical docs that mix narrative explanation with dense reference tables, for example.
Your chunks need to be smaller than recursive splitting can keep coherent (e.g., 200-token chunks where every cut on a paragraph boundary truncates an idea).
You're already running cheap embeddings on every sentence for another reason.

If none of those apply, you're paying 10–100× the preprocessing cost to lose to a tuned recursive splitter.

Add context to chunks, not cleverness to splits

Once your size sweep stops moving the needle, the next lever isn't a fancier splitter — it's giving each chunk more context. Anthropic's Contextual Retrieval is the cleanest version of this idea:

CONTEXT_PROMPT = """<document>
{whole_document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Give a short (50-100 token) context that situates this chunk in the
overall document. Answer only with the context, nothing else."""

def contextualize(client, whole_doc, chunk):
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(
                whole_document=whole_doc, chunk=chunk
            ),
        }],
    )
    return msg.content[0].text

# Then embed: f"{context}\n\n{chunk}" instead of just chunk

In production you almost certainly want prompt caching on whole_document — that's what gets the per-token indexing cost down to roughly $1 per million document tokens. Without caching, this approach is too expensive to be a default; with it, it's a reasonable line item.

You combine that with BM25 on the same contextualized text and a reranker on top of the union of dense + sparse hits, and you've reproduced most of the 67% retrieval-failure reduction Anthropic reported — without ever leaving recursive chunking.

Honest tradeoffs

A few things this post is not claiming:

That recursive chunking is optimal. It's a strong default. Your corpus might beat it with structure-aware splitting (Markdown headers, code AST) — that's worth trying before semantic chunking and is usually cheaper at index time too.
That bigger chunks are always better. The Mosaic AI sweep showed monotonic gains to 4096, but they were also running a long-context model. With an 8k-context generator, dumping 4k-token chunks limits how many you can stuff into the prompt. The right answer depends on your generator and your top-k.
That contextual retrieval is free. It costs an LLM call per chunk at index time. Worth it for high-value, slow-churn corpora (product docs, legal). Probably not for a corpus you re-index hourly.

Wrapping up

If you've never tuned chunking, the play is:

Start with RecursiveCharacterTextSplitter, 1024 tokens, 10–20% overlap.
Build a small (40–100) labeled eval set on your real corpus.
Sweep chunk size and overlap. Pick the best cell.
If retrieval is still the bottleneck, add contextual retrieval + BM25 + a reranker before you reach for semantic splitting.

The boring knob beats the smart algorithm in most real systems. Tune it.

What chunk size are you running in production — and is it a tuned number, or the default from a LangChain tutorial? Curious how many teams have actually swept this.