DEV Community

PSBigBig
PSBigBig

Posted on

# day 4 — bad chunking ruins retrieval (ProblemMap: No 5 semantic embedding, No 14 bootstrap ordering)

most rag bugs i see are not the retriever. not the reranker. it’s the chunker quietly corrupting the embedding space. different models look fine on unit prompts, then fall apart under real docs. here is the boring reason and the boring fix.

the quiet failure

you think the problem is:

  • “faiss is noisy”
  • “reranker is weak”
  • “we need bigger context”

the reality:

  • structural units are mixed with non structural text. headers, footers, nav crumbs, and table debris dominate cosine space
  • chunk boundaries ignore semantic scope. half sentences get glued to a figure caption and a signature line
  • normalization is inconsistent. newline habits differ across sources. some chunks still carry non printable characters
  • bootstrap is inverted. you set size first, then try to retrofit structure. by then it is too late

once this happens, cosine neighbors are no longer semantic neighbors. you can tune params forever and still miss the obvious paragraph.

a small math view

let a chunk be c. let its token sequence be t1..tn with segment tags s1..sn from a finite set S = {title, header, body, code, table, foot}.
a well formed chunk maximizes

score(c) = α·coh(c) + β·scope(c) + γ·clean(c)

coh: intra chunk semantic coherence
scope: completeness of one idea unit
clean: structural noise penalty

bad pipelines maximize only len(c) near a target window k, with weak penalties. result is high variance neighbors.

the minimal fix

structure first. then length.

step 1. pre segmentation
parse doc to structural blocks: title, headings, paragraphs, lists, code fences, table cells, figure captions. throw away obvious boilerplate (header, footer, page number, printed date). keep a doc_id and page for audit.

step 2. normalize

  • standardize whitespace, unicode, line endings
  • remove zero width and private unicode
  • collapse repeated spaces and line breaks, but preserve code fences and tables

step 3. scope stitch
join adjacent blocks within the same semantic scope until you reach a soft limit. prefer complete sentences. avoid cross type stitches like caption + unrelated paragraph.

step 4. length trim
only now apply token window k with a soft clamp. allow 0.8k to 1.2k if it keeps a sentence whole. never split inside code or table rows.

step 5. embed with tags
prefix a light schema into the text so the model sees structure without bloating the vector:

[H2] model evaluation
[P] we measure exact match, f1, and semantic recall on...
Enter fullscreen mode Exit fullscreen mode

do not dump html. short tags are enough.

step 6. index metrics
store per chunk:

  • doc_id, page_range, scope_tags, token_count
  • sha1 of raw text after normalization
  • centroid id if you do pooling

verify that the index distribution matches the model’s comfort zone. long tail near zero tokens means you kept junk. big spikes at exactly k tokens means you chopped ideas.

a tiny reference implementation

python like sketch. not pretty, just clear.

def blocks(doc):
    for b in parse_pdf_or_html(doc):
        if b.type in {"header","footer","pagenum"}:
            continue
        yield normalize(b)

def stitch(blocks, target=700, hard=1100):
    cur, cur_tags = [], []
    cur_len = 0
    for b in blocks:
        if not cur:
            cur.append(b.text); cur_tags.append(b.tag); cur_len = b.tokens
            continue
        same_scope = (cur_tags[-1] == b.tag) or (b.tag == "paragraph")
        fits_soft = cur_len + b.tokens <= target
        if same_scope and (fits_soft or ends_sentence(cur[-1])):
            cur.append(b.text); cur_tags.append(b.tag); cur_len += b.tokens
        else:
            yield pack(cur, cur_tags)
            cur, cur_tags, cur_len = [b.text], [b.tag], b.tokens
    if cur:
        yield pack(cur, cur_tags)

def pack(lines, tags):
    text = "\n".join(lines)
    head = tags[0].upper()
    return f"[{head}] {text.strip()}"
Enter fullscreen mode Exit fullscreen mode

swap your own parser in parse_pdf_or_html. the point is the order of concerns:

  1. structure
  2. scope
  3. length

how to know it worked

measure before and after on the same corpus.

metric bad chunking structure first what to look for
top1 hit on golden q&a 52 to 63 72 to 84 big jump without reranker change
top5 recall 71 to 78 86 to 92 fewer near misses
empty or near zero vectors 1.5 to 3.0 percent under 0.2 percent normalization fixed
duplicate neighbor rate (cosine > 0.995) 4 to 8 percent under 1 percent boilerplate removed
average tokens per chunk hard spike at k smooth 0.8k to 1.1k scope preserved
manual audit time hours minutes less “why is this here”

numbers are typical for mixed pdf and html corpora after a straight swap of the chunker. no retriever change. no vector db change.

why this maps to ProblemMap

  • No 5 semantic ≠ embedding
    cosine match does not imply meaning. you fix the meaning leak by keeping structure and scope intact before you embed.

  • No 14 bootstrap ordering
    most teams start from window size. that is upside down. if you flip the order, half your pipeline problems disappear.

this is a semantic firewall idea. you repair meaning at the boundary of text and vectors. you do not need to change infra.

quick checklist you can paste into your issue template

- [ ] headers and footers stripped before chunking
- [ ] unicode and whitespace normalized
- [ ] scope stitch respects sentence and block types
- [ ] soft clamp on token window, no hard k spikes
- [ ] structural tags prefixed minimally
- [ ] index stores audit fields: doc_id, page, hash
- [ ] empty and duplicate vector rates below thresholds
- [ ] golden set measured before and after
Enter fullscreen mode Exit fullscreen mode

closing

if your retrieval works in the lab and fails with real docs, fix the chunker first. structure, then scope, then length. the rest of the stack suddenly looks smarter.

if you want the full table of failure modes and fixes, ping me for the ProblemMap. it is open, MIT, and focused on real world bugs.

https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md

Top comments (0)