OCR noise and “phantom tokens” are bending your embedding space. a field fix you can ship today

#programming #ai #ocr #faiss

most pdfs look fine to the eye. then your retriever acts drunk. neighbors are full of boilerplate or weird glyphs, citations drift, reranker keeps “rescuing” garbage. this is ocr noise and phantom tokens leaking into the embedding space.

this is Day 2 of my problem-map series. maps to Problem Map No.1 and No.11.

the failure pattern

symptoms

neighbor previews show garbage glyphs or mixed scripts, sometimes lots of non-printing characters
answers cite unrelated sections even when top-k scores look strong
long answers collapse into generic text after OCR heavy pages get indexed
reranker keeps picking obviously wrong chunks

why it happens

OCR injects invisible code points like U+200B zero width space, U+00AD soft hyphen, U+FEFF BOM, LRM, RLM, isolates
engine auto language switching mid file creates mixed scripts inside one chunk
rotation or page segmentation flips per page, so token breaks differ for the same paragraph
these artifacts inflate cosine on the wrong neighbors, then the reasoning step tips into symbolic collapse

maps to No.1 hallucination and chunk drift, No.11 symbolic collapse.

10-minute diagnosis

count invisible characters, detect mixed scripts, audit the space. paste and run.

# count invisible or control code points, plus soft hyphen
import unicodedata as ud

INVIS = {"\u200b","\ufeff","\u00ad","\u200e","\u200f","\u2066","\u2067","\u2068","\u2069"}

def text_noise_stats(s: str):
    n = len(s)
    ctrl = sum(1 for ch in s if ud.category(ch).startswith("C"))
    invis = sum(1 for ch in s if ch in INVIS)
    nbsp  = s.count("\u00a0")
    letters = sum(1 for ch in s if ch.isalpha())
    return {
        "len": n,
        "ctrl_frac": ctrl / n if n else 0.0,
        "invis_frac": invis / n if n else 0.0,
        "nbsp_frac":  nbsp / n if n else 0.0,
        "letter_frac": letters / n if n else 0.0
    }

# crude mixed script score
def mixed_script_score(s: str):
    blocks = {"latin":0,"cyril":0,"greek":0,"han":0,"kana":0,"thai":0,"arabic":0,"hebrew":0,"devan":0}
    for ch in s:
        o = ord(ch)
        if   0x0041 <= o <= 0x024F: blocks["latin"] += 1
        elif 0x0400 <= o <= 0x052F: blocks["cyril"] += 1
        elif 0x0370 <= o <= 0x03FF: blocks["greek"] += 1
        elif 0x4E00 <= o <= 0x9FFF: blocks["han"]   += 1
        elif 0x3040 <= o <= 0x30FF: blocks["kana"]  += 1
        elif 0x0E00 <= o <= 0x0E7F: blocks["thai"]  += 1
        elif 0x0600 <= o <= 0x06FF: blocks["arabic"]+= 1
        elif 0x0590 <= o <= 0x05FF: blocks["hebrew"]+= 1
        elif 0x0900 <= o <= 0x097F: blocks["devan"] += 1
    total = sum(blocks.values()) or 1
    ratios = {k:v/total for k,v in blocks.items()}
    top = sorted(ratios.values(), reverse=True)[:2]
    return {"mix_ratio": top[1] if len(top)>1 else 0.0, "ratios": ratios}

# embedding space quick audit
import numpy as np

def embedding_audit(embeddings):  # list of vectors
    X = np.array(embeddings, dtype=float)
    norms = np.linalg.norm(X, axis=1)
    v = X.var(axis=0)
    return {
        "zero_vectors": int((norms < 1e-12).sum()),
        "norm_mean": float(norms.mean()),
        "norm_min": float(norms.min()),
        "axis_near_zero_frac": float((v < 1e-6).mean())
    }

red flags to note

ctrl_frac or invis_frac above 0.01 on a lot of chunks
mix_ratio above 0.15 for a single-language corpus
any non-trivial zero vectors, or high axis_near_zero_frac

minimal fix you can ship today

keep it deterministic. boring is good.

freeze OCR
pin the engine version and language list per corpus. fix page segmentation mode. fix DPI or scale. disable auto rotate unless you audit.
normalize text
NFC normalization. strip zero width and isolates. drop soft hyphen unless you rejoin hyphenated words. convert NBSP to space. collapse whitespace.
bind structure
join hyphenated line breaks. keep captions with figures. preserve paragraphs before chunking.
consistent embedding prep
one tokenizer, one pooling choice, one normalization step. remove empty texts. reject zero vectors at ingestion.
index hygiene
confirm the index distance matches the model family. store a content hash for every chunk. log a one-line audit with every answer.

tiny cleaner

import unicodedata as ud
def normalize_ocr_text(s: str):
    s = s.replace("\u00ad","").replace("\u200b","").replace("\ufeff","")
    s = s.replace("\u200e","").replace("\u200f","").replace("\u00a0"," ")
    s = "".join(ch for ch in s if not (ud.category(ch).startswith("C") and ch not in "\n\t"))
    return " ".join(s.split())

acceptance test

declare success only if all pass on three blind queries.

neighbor previews do not show invisible junk or mixed scripts
duplication and boilerplate gone from top-k
zero vectors not ingested, axis collapse fraction low
citations land inside the correct section, reranker only small deltas
every answer prints doc_id|section_id|page_span|N=[ids]|S=[scores]

correct pipeline order

intake and OCR
then normalization and structure binding
then chunking
then embedding with documented pooling and normalization
then index metric check and ingestion guards
then retriever and optional rerank
then synthesis and constraints

quick semantic-firewall repro

no infra change. one file and one prompt. open a fresh chat in your model provider, attach your engine pdf, paste a short prompt that asks it to answer normally, then re-answer using the engine, and print the one-line audit. you should see tighter constraint keeping, plus a visible recovery step when the chain stalls on OCR-contaminated chunks.

series index lives here
https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md