RAG Chunk Size in 2026 The Decision Table Most Teams Skip

#rag #ai #llm #howto

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You wired up your RAG pipeline. The tutorial said chunk_size=1000, chunk_overlap=200, so that's what you set. It worked on the demo corpus. Then you pointed it at the real documents: a mix of API reference, support tickets, legal contracts, and a 400-page product manual. Recall dropped. Some answers cite half a sentence. Others dump three pages of context into the prompt and the model loses the thread.

The default did not break. It was never the right number for four different document types at once. Chunk size is one knob that everyone copies from a starter repo and almost nobody measures against their own corpus.

The two things chunk size trades against

Every chunk size is a bet between two failure modes.

Make chunks too small and you win on retrieval precision but lose context. The embedding for a 150-token chunk is tight, so it matches a query closely. But the matched chunk is a fragment. "The Licensee shall indemnify…" retrieves cleanly and answers nothing, because the scope clause sat in the next chunk you did not retrieve.

Make chunks too large and you win on context but lose precision and pay more. A 2000-token chunk carries the full clause, but its embedding is an average over many topics. A narrow query matches it weakly, so it ranks lower than a smaller chunk that is only loosely related. And every large chunk you do retrieve burns more of your context budget, which is real money per query and real latency at scale.

So the question is never "what is the best chunk size." It is "what is the best chunk size for this kind of document, given the queries my users actually ask."

Why one number cannot win

The reason a single chunk size fails is that document types have different natural units of meaning.

An API reference is dense and self-contained per entry. A function signature plus its description is a complete idea in 120 tokens. Chunk that at 1000 and you glue six unrelated functions together; the embedding smears across all of them.

A legal contract is the opposite. The unit of meaning is the clause, and clauses reference definitions pages away. Chunk too small and you sever the references. You want larger chunks with generous overlap, or a parent-child split.

Support tickets are short and conversational. One ticket is often one chunk. Splitting a 300-token ticket into three pieces gives you three fragments of a single complaint.

Narrative prose (a manual, a runbook) sits in the middle. Sections are the unit. Split on headings, not on a fixed token count.

Same pipeline, four natural chunk sizes. A fixed chunk_size=1000 is wrong for at least three of them.

The decision table

Start here, then tune with an eval. These are starting points by document type, not laws.

Document type	Chunk size (tokens)	Overlap	Split strategy	Why
API / code reference	100–250	0–10%	Per-entry (function, endpoint)	Each entry is self-contained; small chunks keep embeddings sharp
Support tickets / chat	200–400	0%	One message or thread per chunk	Conversational unit is already small
Narrative prose / manuals	400–800	10–15%	Recursive on headings, then sentences	Section is the unit; overlap saves cross-paragraph context
Legal / contracts	600–1000 + parent doc	15–20%	Parent-child (clause → section)	Clauses reference distant definitions
Tables / structured data	One row group per chunk	0%	Row-aware, keep header with rows	Structure carries the meaning, not prose flow
Q&A / FAQ	1 pair per chunk	0%	Per question-answer pair	The pair is the atomic unit

The numbers assume a modern embedding model with an 8K input window, so the ceiling is not your constraint; precision is. If you run an older model with a 512-token limit, your upper bound drops and the legal row needs parent-child retrieval, not big chunks.

Overlap is not free padding

Overlap exists to stop an idea from being cut in half at a boundary. A sentence that starts in chunk 4 and finishes in chunk 5 should appear whole in at least one of them. Overlap buys that.

But overlap is duplication. At 20% overlap on a 100K-chunk corpus, you are storing and embedding 20% more vectors, and at query time you retrieve near-duplicate chunks that crowd out diverse results. Two of your top five come back nearly identical because they share an overlapping span.

The interplay with chunk size is the part teams miss. Small chunks need little or no overlap, because the split lands often enough that no single idea spans many chunks. Large chunks need more, because a boundary inside a 1000-token chunk can sever a multi-sentence argument. Scale overlap with chunk size, do not pick one global percentage.

def overlap_for(chunk_size: int) -> int:
    # Smaller chunks split often; less overlap needed.
    if chunk_size <= 250:
        return 0
    if chunk_size <= 600:
        return int(chunk_size * 0.10)
    return int(chunk_size * 0.18)

A recursive splitter that respects structure beats a blind character window every time. Split on the biggest separator that fits, fall back to smaller ones. The example below uses LangChain's RecursiveCharacterTextSplitter.

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
)

def splitter_for(doc_type: str):
    size = {
        "api": 200,
        "ticket": 300,
        "prose": 600,
        "legal": 900,
    }[doc_type]
    return RecursiveCharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=overlap_for(size),
        separators=["\n\n", "\n", ". ", " ", ""],
    )

The separators order matters. It tries paragraph breaks first, then lines, then sentences, then words. A blind splitter that cuts at exactly N characters will slice mid-word and produce a chunk that starts with " tion shall apply".

Route by document type at ingestion

The table only pays off if your pipeline knows what kind of document it is chunking. Tag the source at ingestion and pick the splitter from the tag.

def chunk_document(text: str, doc_type: str) -> list[dict]:
    splitter = splitter_for(doc_type)
    pieces = splitter.split_text(text)
    return [
        {
            "text": piece,
            "doc_type": doc_type,
            "chunk_index": i,
        }
        for i, piece in enumerate(pieces)
    ]

If you cannot tag by hand, a cheap classifier on the first 500 characters gets you most of the way: code fences and signatures mean API reference, "WHEREAS" and clause numbering mean legal, a heading tree means prose. Store the type as chunk metadata. It earns its keep later for filtering and for debugging which chunk size produced a bad answer.

Measure before you guess

Here is the part that separates a pipeline that ships from one that stalls. You do not pick chunk size by reading a blog post. You sweep it against your own eval set.

Build a gold set: 100–200 real queries from your domain, each labelled with the document IDs that contain the answer. Then run a sweep over candidate chunk sizes and read two numbers per size.

def context_recall(retrieved_ids, gold_ids) -> float:
    if not gold_ids:
        return 0.0
    hits = len(set(retrieved_ids) & set(gold_ids))
    return hits / len(gold_ids)

def sweep_chunk_sizes(corpus, gold, sizes):
    results = {}
    for size in sizes:
        index = build_index(corpus, chunk_size=size,
                            overlap=overlap_for(size))
        recalls, tokens = [], []
        for query, gold_ids in gold:
            hits = index.search(query, k=5)
            recalls.append(
                context_recall([h.id for h in hits], gold_ids)
            )
            tokens.append(sum(h.token_count for h in hits))
        results[size] = {
            "recall_at_5": sum(recalls) / len(recalls),
            "avg_context_tokens": sum(tokens) / len(tokens),
        }
    return results

Two columns tell the story: recall@5 and average context tokens per query. Recall climbs as chunks get bigger, up to a point, then flattens. Context tokens climb the whole way. The right size is where recall stops improving meaningfully but tokens are still climbing. That knee is your answer, and it lands in a different place for each document type.

Re-run the sweep on every embedding-model change and every corpus shift. A model swap moves the knee. A new document source moves it again. The eval is what makes the move cheap.

The short version

Stop shipping one chunk size for every document. Tag documents by type at ingestion, pick the starting size from the table, scale overlap with chunk size instead of hard-coding 200, and confirm with a recall-vs-tokens sweep on your own gold set. The default chunk_size=1000 from the starter repo is a guess. Your corpus deserves a measurement.

If this was useful

Chunk size is one knob in a longer chain, and getting it right depends on the retriever, the reranker, and the eval rig around it. The RAG Pocket Guide walks through chunking strategy, the recall-vs-context trade-offs, and how to build the eval set that tells you which numbers to trust. If your retrieval layer is the weak link, that's the place to start.