Every other week a new "smart" chunking strategy lands on AI Twitter — semantic, agentic, propositional, late chunking. Meanwhile the two boring knobs that actually move retrieval quality (chunk size and overlap) sit at whatever default a tutorial picked in 2023.
This post is for engineers shipping RAG who want a defensible chunking choice instead of a vibes-based one. By the end you'll have: a clear picture of what the recent research says, a working Python eval harness that compares chunking strategies on your own data, and a concrete production default to start from.
The chunking strategies, very briefly
There are basically four families in the wild:
- Fixed-size: split every N tokens. Fastest, dumbest, cuts mid-sentence.
-
Recursive character splitting (LangChain's
RecursiveCharacterTextSplitter): tries paragraph → sentence → word until chunks fit. The pragmatic default for prose. - Document-structure-aware: split on Markdown headers, HTML tags, or code AST nodes. Keeps logical sections intact.
-
Semantic chunking (LlamaIndex's
SemanticSplitterNodeParserand friends): embed each sentence, cut where adjacent-embedding distance spikes past a percentile. Topically coherent, much more expensive.
The intuition for the last one is seductive — "let embeddings decide where ideas end." That's also the one that doesn't reliably pay off.
What recent research actually shows
Two independent results are worth knowing before you pick a strategy.
Chroma's chunking eval (Brandon Smith and Anton Troynikov) tested embedding-similarity splitters and LLM cluster chunkers against naive recursive and fixed-size chunking, scored with Intersection-over-Union and Recall on multiple corpora. The headline: semantic methods showed inconsistent, often negligible gains. Sometimes they lost. The dominant variables were chunk size and overlap, not the splitter. Default RecursiveCharacterTextSplitter at ~200–400 tokens was a strong baseline.
Databricks Mosaic AI's FinanceBench sweep went the other direction — fix the splitter (recursive), vary chunk size, measure answer correctness end-to-end:
- 512-token chunks → ~36% correctness
- 1024 → ~42%
- 2048 → ~45%
- 4096 → ~47%
Bumping overlap from 20% to 50% added less than a point and roughly doubled the index. In other words, larger chunks helped more than fancier splitting — and overlap mostly bought you a bigger index.
Anthropic's Contextual Retrieval is the one place "smart" preprocessing clearly paid off. Their move wasn't splitting cleverly; it was augmenting each chunk with ~50–100 tokens of LLM-generated context before embedding:
- Contextual Embeddings alone: 35% fewer failed retrievals (5.7% → 3.7%)
- Add Contextual BM25: 49% reduction
- Add a reranker: 67% reduction
- Indexing cost: ~$1.02 per million document tokens with Claude Haiku + prompt caching
- Their sweet spot: 800 tokens, 100-token overlap
The pattern across all three: optimize the cheap knobs first (size, overlap), then augment chunks if you need more, and treat semantic splitting as a last resort.
A small eval harness you can actually run
You don't need a benchmark suite to make this call on your own corpus. Forty labeled (question, expected_snippet) pairs and an afternoon will do it. Here's the minimal harness.
# pip install langchain langchain-community sentence-transformers faiss-cpu
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
def build_index(docs, chunk_size, chunk_overlap):
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.create_documents(docs)
return FAISS.from_documents(chunks, embeddings)
def recall_at_k(index, eval_set, k=5):
hits = 0
for q, expected in eval_set:
results = index.similarity_search(q, k=k)
if any(expected.lower() in r.page_content.lower() for r in results):
hits += 1
return hits / len(eval_set)
Now sweep:
docs = [open("corpus.txt").read()] # your real corpus
eval_set = [
("What's the refund policy?", "refunds are issued within 14 days"),
# ... 40 of these
]
for size in [256, 512, 1024, 2048]:
for overlap_pct in [0.0, 0.1, 0.2]:
idx = build_index(docs, size, int(size * overlap_pct))
score = recall_at_k(idx, eval_set, k=5)
print(f"size={size:>4} overlap={int(overlap_pct*100):>2}% recall@5={score:.3f}")
The first time you run this on real data it's deflating in a useful way. Most teams discover the difference between their current setup and the best cell in this grid is bigger than the difference between any two splitter algorithms.
Note: "Expected snippet appears in any top-k chunk" is a coarse metric. It's fine for picking between configs; for production-grade evals you want a proper retrieval IoU or a downstream answer-correctness score, ideally with an LLM-as-judge over (question, retrieved_chunks, ground_truth_answer).
When semantic chunking is worth the bill
The Chroma study isn't a blanket "never use it." Semantic splitting helps when:
- Your corpus is very heterogeneous in topic density — long technical docs that mix narrative explanation with dense reference tables, for example.
- Your chunks need to be smaller than recursive splitting can keep coherent (e.g., 200-token chunks where every cut on a paragraph boundary truncates an idea).
- You're already running cheap embeddings on every sentence for another reason.
If none of those apply, you're paying 10–100× the preprocessing cost to lose to a tuned recursive splitter.
Add context to chunks, not cleverness to splits
Once your size sweep stops moving the needle, the next lever isn't a fancier splitter — it's giving each chunk more context. Anthropic's Contextual Retrieval is the cleanest version of this idea:
CONTEXT_PROMPT = """<document>
{whole_document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Give a short (50-100 token) context that situates this chunk in the
overall document. Answer only with the context, nothing else."""
def contextualize(client, whole_doc, chunk):
msg = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[{
"role": "user",
"content": CONTEXT_PROMPT.format(
whole_document=whole_doc, chunk=chunk
),
}],
)
return msg.content[0].text
# Then embed: f"{context}\n\n{chunk}" instead of just chunk
In production you almost certainly want prompt caching on whole_document — that's what gets the per-token indexing cost down to roughly $1 per million document tokens. Without caching, this approach is too expensive to be a default; with it, it's a reasonable line item.
You combine that with BM25 on the same contextualized text and a reranker on top of the union of dense + sparse hits, and you've reproduced most of the 67% retrieval-failure reduction Anthropic reported — without ever leaving recursive chunking.
Honest tradeoffs
A few things this post is not claiming:
- That recursive chunking is optimal. It's a strong default. Your corpus might beat it with structure-aware splitting (Markdown headers, code AST) — that's worth trying before semantic chunking and is usually cheaper at index time too.
- That bigger chunks are always better. The Mosaic AI sweep showed monotonic gains to 4096, but they were also running a long-context model. With an 8k-context generator, dumping 4k-token chunks limits how many you can stuff into the prompt. The right answer depends on your generator and your top-k.
- That contextual retrieval is free. It costs an LLM call per chunk at index time. Worth it for high-value, slow-churn corpora (product docs, legal). Probably not for a corpus you re-index hourly.
Wrapping up
If you've never tuned chunking, the play is:
- Start with
RecursiveCharacterTextSplitter, 1024 tokens, 10–20% overlap. - Build a small (40–100) labeled eval set on your real corpus.
- Sweep chunk size and overlap. Pick the best cell.
- If retrieval is still the bottleneck, add contextual retrieval + BM25 + a reranker before you reach for semantic splitting.
The boring knob beats the smart algorithm in most real systems. Tune it.
What chunk size are you running in production — and is it a tuned number, or the default from a LangChain tutorial? Curious how many teams have actually swept this.
Top comments (0)