most pdfs look fine to the eye. then your retriever acts drunk. neighbors are full of boilerplate or weird glyphs, citations drift, reranker keeps “rescuing” garbage. this is ocr noise and phantom tokens leaking into the embedding space.
this is Day 2 of my problem-map series. maps to Problem Map No.1 and No.11.
the failure pattern
symptoms
- neighbor previews show garbage glyphs or mixed scripts, sometimes lots of non-printing characters
- answers cite unrelated sections even when top-k scores look strong
- long answers collapse into generic text after OCR heavy pages get indexed
- reranker keeps picking obviously wrong chunks
why it happens
- OCR injects invisible code points like U+200B zero width space, U+00AD soft hyphen, U+FEFF BOM, LRM, RLM, isolates
- engine auto language switching mid file creates mixed scripts inside one chunk
- rotation or page segmentation flips per page, so token breaks differ for the same paragraph
- these artifacts inflate cosine on the wrong neighbors, then the reasoning step tips into symbolic collapse
maps to No.1 hallucination and chunk drift, No.11 symbolic collapse.
10-minute diagnosis
count invisible characters, detect mixed scripts, audit the space. paste and run.
# count invisible or control code points, plus soft hyphen
import unicodedata as ud
INVIS = {"\u200b","\ufeff","\u00ad","\u200e","\u200f","\u2066","\u2067","\u2068","\u2069"}
def text_noise_stats(s: str):
n = len(s)
ctrl = sum(1 for ch in s if ud.category(ch).startswith("C"))
invis = sum(1 for ch in s if ch in INVIS)
nbsp = s.count("\u00a0")
letters = sum(1 for ch in s if ch.isalpha())
return {
"len": n,
"ctrl_frac": ctrl / n if n else 0.0,
"invis_frac": invis / n if n else 0.0,
"nbsp_frac": nbsp / n if n else 0.0,
"letter_frac": letters / n if n else 0.0
}
# crude mixed script score
def mixed_script_score(s: str):
blocks = {"latin":0,"cyril":0,"greek":0,"han":0,"kana":0,"thai":0,"arabic":0,"hebrew":0,"devan":0}
for ch in s:
o = ord(ch)
if 0x0041 <= o <= 0x024F: blocks["latin"] += 1
elif 0x0400 <= o <= 0x052F: blocks["cyril"] += 1
elif 0x0370 <= o <= 0x03FF: blocks["greek"] += 1
elif 0x4E00 <= o <= 0x9FFF: blocks["han"] += 1
elif 0x3040 <= o <= 0x30FF: blocks["kana"] += 1
elif 0x0E00 <= o <= 0x0E7F: blocks["thai"] += 1
elif 0x0600 <= o <= 0x06FF: blocks["arabic"]+= 1
elif 0x0590 <= o <= 0x05FF: blocks["hebrew"]+= 1
elif 0x0900 <= o <= 0x097F: blocks["devan"] += 1
total = sum(blocks.values()) or 1
ratios = {k:v/total for k,v in blocks.items()}
top = sorted(ratios.values(), reverse=True)[:2]
return {"mix_ratio": top[1] if len(top)>1 else 0.0, "ratios": ratios}
# embedding space quick audit
import numpy as np
def embedding_audit(embeddings): # list of vectors
X = np.array(embeddings, dtype=float)
norms = np.linalg.norm(X, axis=1)
v = X.var(axis=0)
return {
"zero_vectors": int((norms < 1e-12).sum()),
"norm_mean": float(norms.mean()),
"norm_min": float(norms.min()),
"axis_near_zero_frac": float((v < 1e-6).mean())
}
red flags to note
-
ctrl_frac
orinvis_frac
above 0.01 on a lot of chunks -
mix_ratio
above 0.15 for a single-language corpus - any non-trivial zero vectors, or high
axis_near_zero_frac
minimal fix you can ship today
keep it deterministic. boring is good.
freeze OCR
pin the engine version and language list per corpus. fix page segmentation mode. fix DPI or scale. disable auto rotate unless you audit.normalize text
NFC normalization. strip zero width and isolates. drop soft hyphen unless you rejoin hyphenated words. convert NBSP to space. collapse whitespace.bind structure
join hyphenated line breaks. keep captions with figures. preserve paragraphs before chunking.consistent embedding prep
one tokenizer, one pooling choice, one normalization step. remove empty texts. reject zero vectors at ingestion.index hygiene
confirm the index distance matches the model family. store a content hash for every chunk. log a one-line audit with every answer.
tiny cleaner
import unicodedata as ud
def normalize_ocr_text(s: str):
s = s.replace("\u00ad","").replace("\u200b","").replace("\ufeff","")
s = s.replace("\u200e","").replace("\u200f","").replace("\u00a0"," ")
s = "".join(ch for ch in s if not (ud.category(ch).startswith("C") and ch not in "\n\t"))
return " ".join(s.split())
acceptance test
declare success only if all pass on three blind queries.
- neighbor previews do not show invisible junk or mixed scripts
- duplication and boilerplate gone from top-k
- zero vectors not ingested, axis collapse fraction low
- citations land inside the correct section, reranker only small deltas
- every answer prints
doc_id|section_id|page_span|N=[ids]|S=[scores]
correct pipeline order
intake and OCR
then normalization and structure binding
then chunking
then embedding with documented pooling and normalization
then index metric check and ingestion guards
then retriever and optional rerank
then synthesis and constraints
quick semantic-firewall repro
no infra change. one file and one prompt. open a fresh chat in your model provider, attach your engine pdf, paste a short prompt that asks it to answer normally, then re-answer using the engine, and print the one-line audit. you should see tighter constraint keeping, plus a visible recovery step when the chain stalls on OCR-contaminated chunks.
series index lives here
https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md
Top comments (0)