most rag bugs i see are not the retriever. not the reranker. it’s the chunker quietly corrupting the embedding space. different models look fine on unit prompts, then fall apart under real docs. here is the boring reason and the boring fix.
the quiet failure
you think the problem is:
- “faiss is noisy”
- “reranker is weak”
- “we need bigger context”
the reality:
- structural units are mixed with non structural text. headers, footers, nav crumbs, and table debris dominate cosine space
- chunk boundaries ignore semantic scope. half sentences get glued to a figure caption and a signature line
- normalization is inconsistent. newline habits differ across sources. some chunks still carry non printable characters
- bootstrap is inverted. you set size first, then try to retrofit structure. by then it is too late
once this happens, cosine neighbors are no longer semantic neighbors. you can tune params forever and still miss the obvious paragraph.
a small math view
let a chunk be c. let its token sequence be t1..tn with segment tags s1..sn from a finite set S = {title, header, body, code, table, foot}.
a well formed chunk maximizes
score(c) = α·coh(c) + β·scope(c) + γ·clean(c)
coh: intra chunk semantic coherence
scope: completeness of one idea unit
clean: structural noise penalty
bad pipelines maximize only len(c) near a target window k, with weak penalties. result is high variance neighbors.
the minimal fix
structure first. then length.
step 1. pre segmentation
parse doc to structural blocks: title, headings, paragraphs, lists, code fences, table cells, figure captions. throw away obvious boilerplate (header, footer, page number, printed date). keep a doc_id and page for audit.
step 2. normalize
- standardize whitespace, unicode, line endings
- remove zero width and private unicode
- collapse repeated spaces and line breaks, but preserve code fences and tables
step 3. scope stitch
join adjacent blocks within the same semantic scope until you reach a soft limit. prefer complete sentences. avoid cross type stitches like caption + unrelated paragraph.
step 4. length trim
only now apply token window k with a soft clamp. allow 0.8k to 1.2k if it keeps a sentence whole. never split inside code or table rows.
step 5. embed with tags
prefix a light schema into the text so the model sees structure without bloating the vector:
[H2] model evaluation
[P] we measure exact match, f1, and semantic recall on...
do not dump html. short tags are enough.
step 6. index metrics
store per chunk:
- doc_id, page_range, scope_tags, token_count
- sha1 of raw text after normalization
- centroid id if you do pooling
verify that the index distribution matches the model’s comfort zone. long tail near zero tokens means you kept junk. big spikes at exactly k tokens means you chopped ideas.
a tiny reference implementation
python like sketch. not pretty, just clear.
def blocks(doc):
for b in parse_pdf_or_html(doc):
if b.type in {"header","footer","pagenum"}:
continue
yield normalize(b)
def stitch(blocks, target=700, hard=1100):
cur, cur_tags = [], []
cur_len = 0
for b in blocks:
if not cur:
cur.append(b.text); cur_tags.append(b.tag); cur_len = b.tokens
continue
same_scope = (cur_tags[-1] == b.tag) or (b.tag == "paragraph")
fits_soft = cur_len + b.tokens <= target
if same_scope and (fits_soft or ends_sentence(cur[-1])):
cur.append(b.text); cur_tags.append(b.tag); cur_len += b.tokens
else:
yield pack(cur, cur_tags)
cur, cur_tags, cur_len = [b.text], [b.tag], b.tokens
if cur:
yield pack(cur, cur_tags)
def pack(lines, tags):
text = "\n".join(lines)
head = tags[0].upper()
return f"[{head}] {text.strip()}"
swap your own parser in parse_pdf_or_html
. the point is the order of concerns:
- structure
- scope
- length
how to know it worked
measure before and after on the same corpus.
metric | bad chunking | structure first | what to look for |
---|---|---|---|
top1 hit on golden q&a | 52 to 63 | 72 to 84 | big jump without reranker change |
top5 recall | 71 to 78 | 86 to 92 | fewer near misses |
empty or near zero vectors | 1.5 to 3.0 percent | under 0.2 percent | normalization fixed |
duplicate neighbor rate (cosine > 0.995) | 4 to 8 percent | under 1 percent | boilerplate removed |
average tokens per chunk | hard spike at k | smooth 0.8k to 1.1k | scope preserved |
manual audit time | hours | minutes | less “why is this here” |
numbers are typical for mixed pdf and html corpora after a straight swap of the chunker. no retriever change. no vector db change.
why this maps to ProblemMap
No 5 semantic ≠ embedding
cosine match does not imply meaning. you fix the meaning leak by keeping structure and scope intact before you embed.No 14 bootstrap ordering
most teams start from window size. that is upside down. if you flip the order, half your pipeline problems disappear.
this is a semantic firewall idea. you repair meaning at the boundary of text and vectors. you do not need to change infra.
quick checklist you can paste into your issue template
- [ ] headers and footers stripped before chunking
- [ ] unicode and whitespace normalized
- [ ] scope stitch respects sentence and block types
- [ ] soft clamp on token window, no hard k spikes
- [ ] structural tags prefixed minimally
- [ ] index stores audit fields: doc_id, page, hash
- [ ] empty and duplicate vector rates below thresholds
- [ ] golden set measured before and after
closing
if your retrieval works in the lab and fails with real docs, fix the chunker first. structure, then scope, then length. the rest of the stack suddenly looks smarter.
if you want the full table of failure modes and fixes, ping me for the ProblemMap. it is open, MIT, and focused on real world bugs.
https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md
Top comments (0)