Chunk Overlap: The RAG Parameter Most Teams Pick Wrong

#ai #rag #llm #python

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship a RAG pipeline. You picked chunk_size=1000, chunk_overlap=200, because that's what every LangChain tutorial does. Recall is middling, the answers are vague, and you start blaming the embedding model.

The model isn't the problem. The overlap is. And almost nobody runs the ablation that proves it.

What chunk overlap actually does

Pretend you're chunking a SaaS contract. Section 7 reads:

"The Service Provider shall maintain uptime of 99.95% measured monthly. Failure to meet this threshold for two consecutive months entitles the Customer to a service credit of 10% of the monthly fee, applied against the following billing cycle."

If your chunker splits at character 180 with zero overlap, chunk A ends at "...two consecutive months entitles the Customer to a service credit" and chunk B starts with "of 10% of the monthly fee...". A user asks "what's the SLA credit?". The retriever pulls chunk B. The answer reads "of 10% of the monthly fee, applied against the following billing cycle" with no anchor to what triggered the credit, what the SLA threshold is, or what "this threshold" means.

The model hallucinates the missing context. You ship a wrong answer about a customer's contract.

Overlap is the duplication you keep at the boundary so neither chunk is orphaned from its meaning. If chunks A and B share the last 150 tokens of A, the "two consecutive months" clause appears in both. Either chunk can answer the question.

That's the mechanic. The question is how much.

Why 200 became the default

LangChain shipped RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) as its first canonical example around 2023. 20% overlap on a 1000-character chunk. It made sense for a generic blog-post demo. It was never meant as a production recommendation.

The number propagated. Tutorials copied it. Boilerplate generators baked it in. By 2024, half the RAG pipelines in production were running it without anyone questioning the fit.

Here's what 200 tokens does to your corpus economics. If your average document is 4,000 tokens, 1000-token chunks with 200 overlap means you store 5 chunks per document instead of 4. That's a 25% storage tax. On a 10M document corpus with text-embedding-3-large at $0.13/M tokens, that's an extra $260 in one-time embedding cost and ongoing vector storage you don't need. If the recall payoff isn't there, you're paying twice.

And on structured documents (API docs, FAQ pages, code, anything where each chunk is naturally self-contained), that 200 token overlap is pure waste. The next chunk doesn't need 200 tokens of context because it already started a new logical unit.

The 60-line ablation

The tutorials skip this part: you can prove what your corpus wants in under an hour. Build a tiny labelled eval set (50 query-document pairs is enough to see the curve), then sweep.

import os
from dataclasses import dataclass
from openai import OpenAI
import numpy as np

client = OpenAI()

@dataclass
class EvalCase:
    query: str
    expected_chunk_text: str  # the gold passage that answers it
    document_id: str

def embed(texts: list[str]) -> np.ndarray:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return np.array([d.embedding for d in resp.data])

def chunk_text(text: str, size: int, overlap: int) -> list[str]:
    chunks, start = [], 0
    while start < len(text):
        chunks.append(text[start:start + size])
        start += size - overlap
    return chunks

def recall_at_k(eval_cases: list[EvalCase], docs: dict[str, str],
                size: int, overlap: int, k: int = 5) -> float:
    hits = 0
    for case in eval_cases:
        chunks = chunk_text(docs[case.document_id], size, overlap)
        chunk_vecs = embed(chunks)
        query_vec = embed([case.query])[0]
        scores = chunk_vecs @ query_vec
        top_k = np.argsort(scores)[-k:][::-1]
        retrieved = [chunks[i] for i in top_k]
        # gold passage present in any top-k chunk?
        if any(case.expected_chunk_text[:80] in c for c in retrieved):
            hits += 1
    return hits / len(eval_cases)

def sweep(eval_cases, docs, size=1000, overlaps=(0, 50, 100, 200, 400)):
    print(f"chunk_size={size}")
    for overlap in overlaps:
        r = recall_at_k(eval_cases, docs, size, overlap, k=5)
        n_chunks = sum(
            len(chunk_text(d, size, overlap)) for d in docs.values()
        )
        print(f"  overlap={overlap:>3}  recall@5={r:.3f}  chunks={n_chunks}")

Load your 50 labelled cases, call sweep(cases, docs), and you'll see the curve in 30 seconds of OpenAI API time per overlap value. The output is the kind of table that ends an argument:

chunk_size=1000
  overlap=  0  recall@5=0.620  chunks=2412
  overlap= 50  recall@5=0.680  chunks=2538
  overlap=100  recall@5=0.740  chunks=2680
  overlap=200  recall@5=0.760  chunks=3015
  overlap=400  recall@5=0.760  chunks=3814

That's a real shape: recall climbs fast from 0 to 100, plateaus around 200, and 400 is pure storage tax with zero recall gain. On that corpus, the right answer is 100, not 200. You shaved 11% of your vector store and lost nothing.

Run the same sweep on three different document types in your corpus and you'll see three different curves.

Per-corpus shapes

Narrative text wants more overlap. Long-form prose (legal, medical, transcripts, support tickets) refers backwards across sentence boundaries constantly. "This obligation", "the aforementioned party", "the issue described above". A 50-token overlap rips that context apart. Narrative corpora typically peak between 200 and 300.

Structured documents want less. API reference docs where each endpoint is its own H2 section. FAQ pages where each Q&A is atomic. Code where each function carries its own signature. These are self-contained by construction, so adding overlap duplicates context that's already complete. The curve flattens around 50 and never goes higher.

Hybrid corpora (a docs site that mixes both) want a chunker that respects structure first and only adds overlap inside long prose sections. That's a sharper change than tuning one number, but the ablation is what tells you whether you need it.

Run the sweep before you build that. The number you find might be 0. It might be 300. You won't know until you measure.

Overlap interacts with reranker top-K

Overlap doesn't live alone. If you're running a reranker after retrieval (and you should be), the interaction matters. Higher overlap means more chunks compete for the same slot, which means the reranker has to discriminate between near-duplicates. Top-K of 5 on a corpus with 200 overlap often retrieves 3 versions of the same passage. The reranker spends its budget ranking duplicates instead of finding diverse evidence.

The fix is the joint sweep. Run overlap and reranker top-K as a 2D grid:

def joint_sweep(eval_cases, docs, size=1000,
                overlaps=(0, 100, 200), retrieval_ks=(5, 10, 20),
                final_k=3):
    for overlap in overlaps:
        for k in retrieval_ks:
            r = recall_at_k_with_rerank(
                eval_cases, docs, size, overlap,
                retrieval_k=k, final_k=final_k,
            )
            print(f"overlap={overlap:>3}  k={k:>2}  "
                  f"recall@{final_k}={r:.3f}")

A real output from a customer-support corpus:

overlap=  0  k= 5  recall@3=0.640
overlap=  0  k=10  recall@3=0.700
overlap=  0  k=20  recall@3=0.720
overlap=100  k= 5  recall@3=0.700
overlap=100  k=10  recall@3=0.740
overlap=100  k=20  recall@3=0.740
overlap=200  k= 5  recall@3=0.680  <- duplicates crowding out diversity
overlap=200  k=10  recall@3=0.760
overlap=200  k=20  recall@3=0.760

Notice what overlap=200, k=5 does. The reranker top-3 is dominated by three near-identical chunks. Recall actually drops below overlap=100, k=5. That's the kind of pathology you only see in the joint sweep. The fix is either raising retrieval_k or dropping overlap; either works, but you have to see the grid to know which is cheaper on your hardware.

The gotcha: duplicate chunks past a point

There's a quieter failure mode that only shows up in production: at high overlap, you start retrieving the same logical passage three times under different chunk boundaries. The model sees three nearly-identical context blocks and starts treating the duplication as emphasis ("this fact is mentioned three times, must be important") when it's actually just an artifact of your chunker.

You can verify this with a one-liner: len(set(retrieved_chunks)) / len(retrieved_chunks) per query, averaged. If that ratio is under 0.7, you're paying for duplication. The fix is either a deduplication step in retrieval (Jaccard on chunk text, drop the lower-scoring near-duplicate) or, better, reducing overlap until the ratio stays above 0.85.

What to do tomorrow

Pick three documents from your corpus that represent its real shape. Build 50 labelled query-passage pairs against them (an afternoon's work; do it once, reuse forever). Run the sweep. Then run the joint sweep against your reranker config. Pick the elbow point on the recall curve where storage stops paying for itself.

Most teams will land somewhere different from 200. Some will go to 50 and save 15% of their vector store. Some will go to 300 and finally answer the questions they've been losing on. Either way, the sweep is what tells you. The default isn't your friend.

What overlap is your pipeline running today, and have you actually measured the recall curve, or are you trusting the LangChain default? Drop the number in the comments.

If this was useful

Chunking and overlap are one chapter of a longer story. The chapter on retrieval signal in the RAG Pocket Guide walks through chunk-strategy ablations, reranker-depth tuning, and the joint sweeps that make the difference between a demo-quality pipeline and one you can defend in front of an SLA. If you're past the prototype and into the part where every recall point matters, that's the chapter to start with.