Alex Spinov

Posted on Jun 28 • Originally published at blog.spinov.online

RAG Chunking: Overlap=0 Drops Facts on the Boundary

#ai #llm #rag #python

Your RAG demo answers every question. Then it ships, and it whiffs on the simplest fact in the corpus. The model is fine. The retriever is fine. The thing that broke is the chunker, and the fix is not the semantic splitter you are about to install.

A fixed-size chunker with overlap=0 cuts a fact in half at the window edge. The chunk you retrieve carries the words from the question but not the value that answers it. On a small synthetic corpus, naive chunking recalls 1 of 5 facts. Adding 24 characters of overlap recovers 3 of them, to 4 of 5. The fifth shows you where overlap stops helping. Here is the whole thing, output included.

corpus: 502 chars, chunk=80
NAIVE  overlap=0   -> 7 chunks, recall@1 = 1/5
   MISS  'refund window'      need '30 days'
   MISS  'express shipping'   need '2 business days'
   MISS  'warranty coverage'  need '12 months'
   HIT   'restocking fee'     need '15 percent'
   MISS  'customs clearance'  need '9 dollars'
FIX    overlap=24  -> 9 chunks, recall@1 = 4/5
   HIT   'refund window'      need '30 days'
   HIT   'express shipping'   need '2 business days'
   HIT   'warranty coverage'  need '12 months'
   HIT   'restocking fee'     need '15 percent'
   MISS  'customs clearance'  need '9 dollars'

CEILING (printed, not hidden):
  - overlap only recovers facts whose anchor->value gap < overlap (24 chars).
  - 'customs clearance' -> '9 dollars' gap > overlap: still MISS with overlap.
  - cost of overlap: index grew 7 -> 9 chunks (duplicated text, more tokens).
  - recall@1 here is on a synthetic corpus; real hit-rate is not claimed.

TL;DR

overlap=0 on a fixed-size chunker silently splits a fact across two chunks. Neither chunk holds the whole answer.
The chunk that scores best for your question can be the half with the question's words and none of the value.
A small overlap (here, 24 chars) re-joins facts whose anchor and value are close. On this corpus, recall went 1/5 to 4/5.
It is a floor, not a cure. A fact whose anchor and value are farther apart than the overlap still gets dropped. Overlap also grows the index (7 to 9 chunks here).
Measure recall before you reach for a semantic chunker. The cheap fix often closes most of the gap.

Where this actually bites

I scrape review and storefront data for a living. Across 2,190 production runs on 32 published actors (the Trustpilot scraper alone has 962 runs), the unit of knowledge is almost never a paragraph. It is a short phrase: a refund window, an order ID, a tracking number, a review date, a price. Two or three of these per record, each one a tiny anchor-plus-value pair.

That shape is exactly what a fixed-size chunker handles worst. When you slice text every N characters with no overlap, you are putting blind cuts through the document at fixed offsets. A long paragraph survives a cut in the middle, because it has redundancy. A four-word fact does not. If the cut lands between "the refund window is" and "30 days", the fact is gone, split across a boundary that neither half can reconstruct.

To be precise about provenance: the corpus and the five facts in the demo below are synthetic literals I wrote to isolate this one failure mode. I am not replaying a captured customer document, and I am not reporting a specific outage. What is real is the shape of the data (short fact-phrases) and why that shape makes the boundary cut so likely. The numbers in the run are exact and reproducible. The hit-rate on your real traffic is yours to measure, not mine to claim.

Why the best-scoring chunk can be the wrong half

Here is the part that makes this a silent bug instead of a loud one.

Retrieval ranks chunks by how well they match the query. Take the query "express shipping". The chunk that contains the words express and shipping scores highest and gets retrieved. That chunk is real, relevant, and top-ranked. It just happens to end right after "Express shipping", because the 80-character cut landed there. The value, "2 business days", is sitting at the start of the next chunk, which scored lower because it does not contain the query words.

So the retriever does its job perfectly and hands the model a chunk that has the question and not the answer. The model then either hallucinates a number or says it does not know. You blame the model. The model was given a chunk that genuinely did not contain the fact.

This is the trap: the failure looks like a generation problem and is actually a geometry problem at ingest time.

The demo

No imports. No network, no clock, no randomness, no file I/O. It is a toy retriever (substring token matching, not embeddings), but the boundary effect it shows is real and does not depend on the retriever being smart. Save it and run python3 -I chunk_overlap_recall.py. The output is byte-identical every time.

"""
chunk_overlap_recall.py -- why a fixed-size RAG chunker with overlap=0
silently drops facts that straddle a chunk boundary, and what a small
overlap does (and does not) fix.

No imports. No network, no clock, no randomness, no file I/O. The corpus
and the facts below are SYNTHETIC literals, written to expose one
structural failure mode. They are not a captured production document.
Run it twice and the output is byte-identical:

    python3 -I chunk_overlap_recall.py
"""

CHUNK = 80  # fixed window size, in characters

# A tiny support knowledge base. Each fact is one short phrase: an anchor
# (the words a question carries) followed by the value that answers it.
LEAD = "Support knowledge base for the storefront. Read every policy below. "
HERO = "The refund window is 30 days from the delivery date confirmed by carrier. "
BODY = (
    "Express shipping takes 2 business days nationwide for in-stock goods now. "
    "Warranty coverage runs 12 months from the original purchase date on file. "
    "A restocking fee of 15 percent applies to opened or clearly used returns. "
)
# Ceiling fact: the anchor 'customs clearance' and the value '9 dollars'
# sit far more than CHUNK characters apart, so no single 80-char window
# can hold both. Overlap cannot save this one.
CEIL = (
    "Customs clearance for international parcels routed through the distant "
    "regional sorting hub is billed at 9 dollars per kilogram of weight."
)
DOC = LEAD + HERO + BODY + CEIL

# Each fact: the words a user's question carries, and the answer span that
# MUST be inside the retrieved chunk for the answer to actually be there.
FACTS = [
    ("refund window",     "30 days"),
    ("express shipping",  "2 business days"),
    ("warranty coverage", "12 months"),
    ("restocking fee",    "15 percent"),
    ("customs clearance", "9 dollars"),
]


def split(text, size, step):
    """Fixed-size character chunker. step = size - overlap."""
    out, i = [], 0
    while i < len(text):
        out.append(text[i:i + size])
        i += step
    return out


def top1(chunks, query):
    """Return the chunk holding the most query tokens as substrings.
    Strict > keeps the tie-break deterministic: the lowest index wins."""
    tokens = query.lower().split()
    best_i, best_score = 0, -1
    for i, ch in enumerate(chunks):
        score = sum(1 for t in tokens if t in ch.lower())
        if score > best_score:
            best_score, best_i = score, i
    return chunks[best_i]


def evaluate(overlap):
    chunks = split(DOC, CHUNK, CHUNK - overlap)
    recall, rows = 0, []
    for query, answer in FACTS:
        chunk = top1(chunks, query)
        hit = answer.lower() in chunk.lower()
        recall += hit
        rows.append((query, answer, hit))
    return len(chunks), recall, rows


n0, rec0, rows0 = evaluate(0)
n1, rec1, rows1 = evaluate(24)

print(f"corpus: {len(DOC)} chars, chunk={CHUNK}")
print(f"NAIVE  overlap=0   -> {n0} chunks, recall@1 = {rec0}/{len(FACTS)}")
for query, answer, hit in rows0:
    print(f"   {'HIT ' if hit else 'MISS'}  {query!r:20} need {answer!r}")
print(f"FIX    overlap=24  -> {n1} chunks, recall@1 = {rec1}/{len(FACTS)}")
for query, answer, hit in rows1:
    print(f"   {'HIT ' if hit else 'MISS'}  {query!r:20} need {answer!r}")
print()
print("CEILING (printed, not hidden):")
print("  - overlap only recovers facts whose anchor->value gap < overlap (24 chars).")
print("  - 'customs clearance' -> '9 dollars' gap > overlap: still MISS with overlap.")
print(f"  - cost of overlap: index grew {n0} -> {n1} chunks (duplicated text, more tokens).")
print("  - recall@1 here is on a synthetic corpus; real hit-rate is not claimed.")

# The headline claim, locked as an assertion: overlap=0 recalls 1 of 5,
# overlap=24 recovers the straddled facts to 4 of 5, customs stays MISS.
assert rec0 == 1, rec0
assert rec1 == 4, rec1

Reading the output

Five facts. With overlap=0, four of them straddle an 80-character boundary and only "restocking fee" lands whole inside one window. Recall is 1 of 5. That single hit matters: it is the honest control. Not everything breaks. Some facts happen to fall cleanly inside a window, which is why this bug hides so well in a demo where you only test the lucky ones.

Switch to overlap=24 and the step shrinks from 80 to 56 characters. Now consecutive windows share 24 characters of tail, so a fact split at a boundary appears whole in at least one of the overlapping windows. Three of the four straddled facts come back. Recall climbs to 4 of 5. No new library, no embeddings, no semantic splitter. One integer.

That is the contrarian part, and I will admit I learned it the slow way. The first time retrieval missed a fact like this, I reached for a semantic chunker, the kind that splits on meaning boundaries. It is a fine tool. It was also the wrong first move. The cheap fix was 24 characters of overlap plus the discipline to measure recall before changing anything. Most of my gap closed for the price of one parameter.

Why overlap is a floor, not a cure

The fifth fact is the whole point of being honest here. Look at "customs clearance" needing "9 dollars". It is MISS with overlap=0 and still MISS with overlap=24. The script prints why, instead of hiding it:

The anchor ("customs clearance") and the value ("9 dollars") in that sentence are separated by far more than 24 characters of filler. No 80-character window holds both, and a 24-character overlap cannot bridge a gap that wide. Overlap recovers facts whose anchor-to-value distance is smaller than the overlap. Beyond that, you genuinely need something else: a larger window, a larger overlap, a structural split that keeps a fact's anchor and value in the same chunk, or yes, a smarter chunker. Overlap raises the floor. It does not remove the ceiling.

So this post is not "overlap fixes RAG". It is narrower and more defensible: overlap=0 drops boundary facts, and a small overlap returns the ones whose pieces are close. Measure first, then decide whether the remaining misses justify a heavier tool.

What overlap costs

Overlap is not free, and the run says so. The index grew from 7 chunks to 9 to cover the same 502 characters. That is duplicated text, which means more vectors to store, more candidates to score, and more tokens stuffed into the model's context at retrieval time. Push the overlap higher and that cost grows with it. There is a real trade-off between recall and index bloat, and "just crank the overlap to 50%" is a bill, not a strategy. Pick the smallest overlap that recovers the facts you care about, and confirm it with a recall number you trust.

This is not the grounding problem

If you have read my earlier post on a RAG system that answered confidently with facts not in the source, this looks adjacent but it is the opposite end of the pipe. That post was about post-generation grounding: the model already answered, and you check whether the answer is actually supported by the retrieved text. This is pre-generation recall: the fact never reached the retriever in the first place, because chunking split it before anything was ranked. One catches a wrong answer after the fact. The other prevents the right fact from ever being available to answer with. You want both checks, and they live at different stages.

What the research says

I am not the first to argue chunking is underrated as a failure source. A May 2026 paper, "Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations" (arXiv 2606.00881), puts it plainly in the abstract: "While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues." The same authors note that the growing list of proposed chunking methods "often claim improved performance ... with limited evidence of their effectiveness across diverse scenarios." I will not quote numbers from that paper that I have not verified line by line, so take it for the framing, not a benchmark: chunking is not a harmless preprocessing step, and a fancier method is not automatically a better one. That matches what the demo shows from the cheap end.

What I would actually ship

Order of operations, in the order that has saved me the most time:

Build a recall test before touching the chunker. A handful of question-and-answer-span pairs that you know are in the corpus. Retrieve top-k, check whether the answer span is present. That is the whole measurement, and it is the thing most RAG setups never have.
Add overlap and re-measure. Start small. On this corpus 24 characters on an 80-character window did the job. The right value depends on how far your anchors sit from their values, which your recall test will tell you.
Watch the index size. Overlap duplicates text. If your chunk count balloons, you are paying in storage and per-query tokens. Trade deliberately.
Only then consider a semantic or structural splitter, for the facts overlap cannot reach (the customs case). Buy the heavier tool against a measured gap, not a hunch.

The reason to do it in that order is boring and correct: the cheap fix closes most of the gap, and you cannot tell how much is left until you measure. A semantic chunker bought before the recall test is a guess with a bigger bill.

One honest caveat on the demo's retriever: I used substring token matching so the script needs zero dependencies and stays byte-reproducible. Real embedding retrieval scores differently, and the exact recall numbers will move. The boundary effect itself does not. A fact split across two chunks is split regardless of how you rank them, because no single chunk holds it. That is the part that survives the toy.

Written with AI assistance and human review. The corpus, the five facts, and the customs ceiling case are synthetic literals chosen to isolate the boundary effect; they are not a captured production document or a reported incident. The code is real, runs with the standard library only, and produces the output shown above deterministically. The production context (2,190 runs across 32 actors, 962 on the Trustpilot scraper) is real and explains why short fact-phrases are the common shape; it is not a measured hit-rate.

Follow for more numbers from production runs. If you run RAG over short fact-phrases, what overlap value did your recall test settle on, and which facts did overlap fail to save?

DEV Community