saurabh naik

Posted on May 18

Chunking in RAG: why your splitter matters more than your embedding model

#ai #python #rag #llm

Most RAG retrieval problems I've debugged came down to the same thing: someone swapped the embedding model three times, added a reranker, then gave up — and never once changed the chunker.

This is backwards. The chunker decides what your embedding model is allowed to see. A great embedding on a bad chunk is still a bad retrieval. And the published research from the last 18 months keeps pointing at the same conclusion: the "smart" chunking strategies don't beat a tuned dumb one. What does beat them is augmenting each chunk with context.

This post walks through the four chunking strategies you'll actually run into, why semantic chunking disappoints on benchmarks, and a working contextual retrieval implementation with the numbers from Anthropic's report. By the end you should have a default chunking recipe you can defend with data, not vibes.

The four chunking strategies

Almost every chunker in the wild is a variation of one of these.

1. Fixed-size

Split every N tokens (or characters) with some overlap.

def fixed_chunks(text: str, size: int = 512, overlap: int = 50):
    tokens = text.split()
    step = size - overlap
    return [" ".join(tokens[i:i + size]) for i in range(0, len(tokens), step)]

Fast and reproducible. Cuts mid-sentence. Useful as a baseline so you have something to beat.

2. Recursive character splitting

The LangChain default. Tries paragraph breaks first, then sentences, then words — recursing until each chunk fits.

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document)

This is the pragmatic default for prose. It respects natural breaks when it can, falls back to character splits when it can't.

3. Document-structure-aware

Uses the document's own structure as the split signal — Markdown headers, HTML tags, code AST nodes. The chunks carry the section path as metadata, which is gold for filtering at retrieval time.

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")],
)
chunks = splitter.split_text(markdown_doc)
# each chunk's metadata: {"h1": "...", "h2": "...", "h3": "..."}

Use this whenever your source has structure. Throwing it away to run recursive character splitting is a self-inflicted wound.

4. Semantic chunking

Embed each sentence, walk through the document, and start a new chunk every time the cosine distance between adjacent sentences exceeds a percentile threshold.

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

splitter = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding(),
)
nodes = splitter.get_nodes_from_documents(documents)

Intuitively appealing. Topically coherent chunks should retrieve better. And it costs you an embedding call per sentence at index time.

The intuition is wrong often enough to matter.

Why semantic chunking often disappoints

Chroma Research ran a careful evaluation last year (Brandon Smith and Anton Troynikov, the latter a Chroma co-founder). They tested embedding-similarity splitters and LLM cluster chunkers against plain recursive and fixed-size chunking across multiple corpora, scoring with Intersection-over-Union and Recall.

The headline result: semantic methods produced inconsistent, often negligible gains. Sometimes they lost. Meanwhile they cost orders of magnitude more in embedding and LLM calls at index time.

The dominant variables across every experiment were chunk size and overlap, not the splitting strategy. A RecursiveCharacterTextSplitter at the right size was a hard-to-beat baseline.

If you're going to spend engineering hours, spend them on a chunk-size sweep, not on a smarter splitter.

import numpy as np

def recall_at_k(retrieved_ids, relevant_ids, k=5):
    return len(set(retrieved_ids[:k]) & set(relevant_ids)) / len(relevant_ids)

# Sweep chunk_size with everything else held constant
for size in [256, 400, 600, 800, 1000, 1200]:
    chunks = chunk_corpus(documents, size=size, overlap=size // 8)
    index = embed_and_index(chunks)
    scores = [recall_at_k(index.search(q), gold) for q, gold in eval_set]
    print(f"size={size}  recall@5={np.mean(scores):.3f}")

You will see a clear curve, not a flat line. Pick the peak. Don't ship a default you never measured.

What actually moves the needle: contextual retrieval

The interesting move isn't a smarter splitter. It's keeping the splitter dumb and giving each chunk back the context it lost when you split it.

This is Anthropic's contextual retrieval recipe. For every chunk, prompt a cheap model with the full document and the chunk, and ask for 50-100 tokens of situating context. Prepend that context to the chunk before embedding.

import anthropic

client = anthropic.Anthropic()

CTX_PROMPT = """<document>
{doc}
</document>

Here is a chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Give a short (50-100 token) context that situates this chunk
within the overall document for retrieval. Answer only with the
succinct context."""

def contextualize(doc: str, chunk: str) -> str:
    msg = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": CTX_PROMPT.format(doc=doc, chunk=chunk),
                 "cache_control": {"type": "ephemeral"}},
            ],
        }],
    )
    return msg.content[0].text

augmented_chunks = [
    f"{contextualize(doc, c)}\n\n{c}" for c in chunks
]

The cache_control block matters. Without prompt caching you pay the full document token cost per chunk. With it, the document is cached once and reused across every chunk call — Anthropic reports roughly a 90% cost reduction on the context-generation step.

The reported numbers on their evaluation corpus (codebases, papers, fiction; top-20 retrieval failure rate):

Contextual Embeddings alone: 35% fewer failed retrievals (5.7% → 3.7%)
+ Contextual BM25 (the same context augmentation applied to a BM25 index): 49% fewer (5.7% → 2.9%)
+ a reranker on top of both: 67% fewer (5.7% → 1.9%)
One-time indexing cost: ~$1.02 per million document tokens with Haiku + prompt caching
Optimal chunk size in their tests: 800 tokens with 100-token overlap, beating both 400 and 1600

The 800/100 number is worth pausing on. It's not "256 because that's what the tutorial said." It's not "1024 because the context window is big." It's a measured optimum on a real corpus. Yours will land somewhere similar but not identical — run the sweep.

When contextual retrieval pays for itself

Indexing cost goes up. Query-time cost is unchanged. So the math is:

How often do you re-index? If you re-index weekly on a 100M-token corpus, that's ~$100/week. Trivial for most production systems.
What's a retrieval miss worth? In support automation a single wrong answer can be measured in minutes of human time. The math is usually obvious.

Where it doesn't pay: tiny corpora (< 1M tokens) where you can fit everything in context anyway, or extremely high-churn corpora where you re-embed many times a day. Everything else, run it.

Note: Contextual retrieval is additive with everything else. Recursive splitter, document-structure-aware metadata, BM25 hybrid, reranker — they all stack. The 67% number assumes the full stack. Don't read that line as "the reranker is doing nothing."

A default recipe to start from

If you're staring at a blank file, this is a reasonable first pass:

Recursive character splitter at 800 tokens, 100 overlap.
Preserve any structural metadata (Markdown headers, file paths) as chunk metadata.
Add 50-100 tokens of LLM-generated context per chunk with Haiku + prompt caching.
Hybrid: vector index + BM25 over the same augmented chunks.
Rerank top-20 down to top-5 with a cross-encoder.
Build a 100-query eval set from real user logs and run a chunk-size sweep against your corpus before treating any of this as settled.

Step 6 is the one most teams skip. Don't.

Wrapping up

Chunking is one of the highest-leverage things in a RAG pipeline and one of the least-measured. The cheap experiments — sweeping chunk size, adding contextual augmentation — usually beat the expensive ones (a fancier embedding model, a third reranker).

Two links worth your time next:

Chroma's chunking evaluation: https://research.trychroma.com/evaluating-chunking
Anthropic's contextual retrieval writeup: https://www.anthropic.com/news/contextual-retrieval

What chunk size do you run in production — and have you actually benchmarked it against alternatives, or is it still the framework default? I'm curious how often teams have a measured answer here.

DEV Community