Gabriel Anhaia

Posted on Apr 18

The RAG Chunking Strategy That Beat All the Trendy Ones in Production

#rag #llm #python #ai

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Every RAG tutorial shows you RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200). Every team that ships one discovers the chunk_size nobody talks about.

The one where your customer's 30-page PSA contract splits across six chunks, the LLM retrieves three of them, and the answer confidently omits the indemnity clause. The one where a product-doc QA bot cites two paragraphs that look relevant and misses the table two pages down that actually answered the question. The one where you swap the embedding model, re-chunk, and watch your eval score fall 12 points.

This post walks through the six chunking strategies teams actually reach for in 2026, scores them on a shared corpus, and lands on the pick that keeps winning — even when a flashier approach gets the blog post.

The evaluation we'll use

Before strategies, the ruler. Two retrieval metrics carry the conversation:

Context recall — of all the facts needed to answer the question, what fraction were in the retrieved chunks? Low recall means the model is answering without the information; hallucination incoming.
Context precision — of the chunks you retrieved, what fraction were actually relevant? Low precision burns context window and drags signal under noise.

The numbers in this post come from a 1,200-question corpus over 2,300 technical-product-doc pages (SaaS changelogs, API references, contract PDFs). Top-5 retrieval, text-embedding-3-large, gpt-4o-2024-11-20 as the generator, Ragas for scoring. Same corpus, same questions, same retriever — only the chunking strategy changes.

1. Fixed-size chunks

def fixed_chunks(text: str, size: int = 800) -> list[str]:
    return [text[i : i + size] for i in range(0, len(text), size)]

How it works in 3 sentences. Slice the text into equal character windows with optional overlap. No respect for sentence, paragraph, or section boundaries. The baseline every other strategy exists to improve on.

When it wins. Homogeneous text with no structure — chat logs, transcripts, single-author essays. Cheapest to compute. Predictable chunk sizes make batch-embedding trivial.

When it loses. The moment a document has headings, tables, or code blocks. Splits mid-sentence, mid-clause, mid-function. Entities are scattered across two chunks the retriever never brings back together.

Scores on the corpus: recall 0.61, precision 0.54. The floor.

2. Recursive character splitting

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(doc)

How it works. Try the largest separator first (blank line), fall back to the next (newline, sentence, word) until the chunk fits chunk_size. Preserves paragraph and sentence boundaries when it can. The default in every LangChain tutorial.

When it wins. Most prose documents. Gives you paragraph-aware splits with one line of config. Hard to beat on engineering effort per point of recall.

When it loses. Tables and structured content get flattened. Headings end up orphaned from the section they describe — the model retrieves "Pricing" without the three paragraphs beneath it. The 200-token overlap hides the damage on easy questions and compounds it on hard ones.

Scores: recall 0.74, precision 0.68. The honest default. Most teams stop here and ship.

3. Semantic chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-large"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)
chunks = chunker.split_text(doc)

How it works. Embed every sentence, walk the document, cut when the cosine distance between adjacent sentences spikes past a threshold. Chunks align with topic shifts rather than character counts.

When it wins. Long-form narrative with clear topic changes — research papers, blog posts, interview transcripts. When you see a 40% recall jump on a semantic-chunker demo, it's usually this kind of corpus.

When it loses. Dense reference docs where every sentence is on-topic. The embedding-distance signal is noisy on technical writing; you get chunks that are either huge (no distance spikes detected) or fragmented (distance spikes on formatting quirks). Also 10–100× more expensive to compute than recursive splitting, and you re-pay every time the corpus changes.

Scores on the product-doc corpus: recall 0.72, precision 0.65. Slightly worse than recursive. Worth trying on prose-heavy corpora. Not worth the compute on dense reference material.

4. Hierarchical / parent-document retrieval

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

vectorstore = Chroma(
    collection_name="children",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-large"),
)
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(docs)

How it works. Split the document twice: small child chunks for retrieval accuracy, larger parent chunks for context. You embed children, but the retriever returns the parent that contains the matching child. Small enough to match precisely, large enough to answer.

When it wins. Almost every real document-QA workload. Contracts, product docs, knowledge bases, runbooks. The small-child embedding finds the exact clause; the parent returns the surrounding section, so the generator sees the defined terms and cross-references.

When it loses. Short documents where a "parent" is the whole thing (you're just retrieving documents). Extremely token-constrained budgets, where even a 2,000-character parent is too expensive to include top-5. Also adds operational weight: two stores to keep consistent, two splitters to tune.

Scores: recall 0.86, precision 0.79. The highest on the corpus. More on why below.

5. Propositional chunking

# Pseudocode — a real proposition extractor is an LLM call per passage.
from openai import OpenAI

client = OpenAI()

PROMPT = """Extract the atomic, standalone factual propositions from
the passage. Each proposition must be true on its own without the
rest of the passage. Return a JSON array of strings."""

def propositions(passage: str) -> list[str]:
    r = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[
            {"role": "system", "content": PROMPT},
            {"role": "user", "content": passage},
        ],
        response_format={"type": "json_object"},
    )
    return parse_json_array(r.choices[0].message.content)

How it works. Use an LLM to decompose each passage into atomic, self-contained propositions. Embed the propositions. At retrieval time, match against propositions and optionally return the originating passage. Research pedigree: Chen et al., Dense X Retrieval (2023).

When it wins. Fact-dense corpora where questions map to single claims — medical guidelines, regulatory text, encyclopedias. Precision tends to be excellent because each proposition is a clean unit.

When it loses. Cost. You pay an LLM call per passage at ingest and re-pay on every corpus update. A 10k-document corpus can run $200–$800 to propositionalize, and that's before you discover your extractor dropped the context a clause needed. Also sensitive to the extractor's prompt: two engineers running the same code get different proposition sets.

Scores: recall 0.81, precision 0.84. Best precision on the corpus. Second-best recall. Expensive to maintain.

6. Late chunking

# Sketch of late chunking with a long-context embedder.
# Real implementation: jinaai/late-chunking on GitHub.
import torch
from transformers import AutoModel, AutoTokenizer

tok = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3")
model = AutoModel.from_pretrained(
    "jinaai/jina-embeddings-v3", trust_remote_code=True
)

def late_chunk_embeddings(doc: str, boundaries: list[tuple[int, int]]):
    # 1. Tokenize and embed the whole doc; keep token embeddings.
    inputs = tok(doc, return_tensors="pt", truncation=False)
    with torch.no_grad():
        out = model(**inputs, output_hidden_states=True)
    token_emb = out.last_hidden_state[0]
    # 2. Pool per chunk AFTER the long-context pass.
    return [token_emb[start:end].mean(dim=0) for start, end in boundaries]

How it works. Feed the whole document to a long-context embedder. Keep the per-token embeddings. Only then apply chunk boundaries, averaging the tokens inside each boundary into a chunk vector. Every chunk vector carries contextual information from the rest of the document — the pronoun "it" in chunk 7 was embedded next to its antecedent in chunk 2.

When it wins. Documents with heavy anaphora and implicit references: legal contracts, academic papers, narrative reports. Solves the "who does 'the Licensee' refer to in this chunk" problem at embed time.

When it loses. Requires a long-context embedder (Jina v3, Voyage-3, Cohere Embed 4, all with 8k–32k context). Harder to cache incrementally — changing one paragraph forces a re-embed of the whole doc. SDK support is thin outside Jina. Still early; few teams have production mileage on it past the 2024 paper.

Scores: recall 0.79, precision 0.76. Beats recursive, not parent-document. Worth watching as the tooling matures.

The scorecard

Strategy	Recall	Precision	Ingest cost (relative)	Ops weight
Fixed	0.61	0.54	1×	trivial
Recursive	0.74	0.68	1×	trivial
Semantic	0.72	0.65	50×	medium
Parent-document	0.86	0.79	1.2×	medium
Propositional	0.81	0.84	200×	heavy
Late chunking	0.79	0.76	3×	medium

Single corpus. One retriever. One generator. Your mileage varies — but the shape is real, and matches numbers reported by teams who've done the same exercise on contracts, runbooks, and product docs.

Why parent-document keeps winning

Look at where real questions fail. The retriever finds the right clause, but the generator needs two paragraphs of surrounding definitions to answer. Or it finds a row in a table, but needs the header two pages up to know what the row means. Or it finds a function, but needs the class docstring to know what the function does.

All three are the same failure: the matching unit is smaller than the answering unit. Parent-document retrieval splits those concerns. Embed at the size that matches well. Return at the size that answers well. Every other strategy forces a single chunk size to do both jobs, and every other strategy pays for it at one end or the other.

Semantic chunking tries to solve this by making chunks bigger when the topic is coherent. Propositional tries by making retrieval units tiny and hoping an LLM stitches them back. Late chunking tries by letting context bleed into embeddings. Parent-document says: stop. The problem isn't "find the perfect chunk." The problem is two separate optimizations.

The other reason parent-document wins in production is boring and undersold: it degrades gracefully. Bad chunks on a recursive splitter produce bad retrieval produces bad answers. Bad child chunks on parent-document retrieval still return a reasonable parent, because the parent is big enough to absorb child-level miscuts. When a new document type shows up in your corpus and breaks your child splitter, the parent still holds.

The hype tells you otherwise

Semantic chunking gets the blog posts because the demo is visual — watch topics shift, watch chunks align. Propositional gets the papers because the precision numbers are beautiful. Late chunking gets the Twitter threads because the technical idea is genuinely clever.

Parent-document retrieval has been sitting in the LangChain codebase since 2023 under the unglamorous name ParentDocumentRetriever. It does not make a good demo. Nobody writes a Medium post titled "How We 10x'd Recall With A Hierarchical Retriever." And yet team after team, after running the matrix above on their own corpus, end up shipping it.

Picking for your corpus

The short version.

Start with recursive. chunk_size=800, chunk_overlap=100, decent separator list. Ship it. Measure on real questions.
If recall is lagging and documents are structured, move to parent-document. Child size 400, parent size 2000. Expect the jump you see in the table above.
If your corpus is fact-dense and small, try propositional. Budget the ingest cost before you start — it is easy to underestimate.
If your documents have heavy cross-reference (contracts, academic PDFs), pilot late chunking with a long-context embedder. Rerun your evals.
Only move to semantic chunking if your corpus is narrative prose with clear topic shifts. Benchmark before committing — it's the strategy where demo results generalize the worst.

And the shortest version, for the team that's going to skim this to the last paragraph: if you're doing document QA, evaluate parent-document retrieval first. Do not let the conference circuit talk you out of it.

If this was useful

Chapter 9 of Observability for LLM Applications covers retrieval instrumentation end-to-end — what to put on a retrieval span, how to catch silent recall regressions, and the RAG-specific eval rig that produced the numbers above. If you're shipping a RAG feature and the debugging feels like staring at a wall of chunk IDs, it's for you.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.

DEV Community