DEV Community

PSBigBig
PSBigBig

Posted on

OCR noise and “phantom tokens” are bending your embedding space. a field fix you can ship today

most pdfs look fine to the eye. then your retriever acts drunk. neighbors are full of boilerplate or weird glyphs, citations drift, reranker keeps “rescuing” garbage. this is ocr noise and phantom tokens leaking into the embedding space.

this is Day 2 of my problem-map series. maps to Problem Map No.1 and No.11.

the failure pattern

symptoms

  • neighbor previews show garbage glyphs or mixed scripts, sometimes lots of non-printing characters
  • answers cite unrelated sections even when top-k scores look strong
  • long answers collapse into generic text after OCR heavy pages get indexed
  • reranker keeps picking obviously wrong chunks

why it happens

  • OCR injects invisible code points like U+200B zero width space, U+00AD soft hyphen, U+FEFF BOM, LRM, RLM, isolates
  • engine auto language switching mid file creates mixed scripts inside one chunk
  • rotation or page segmentation flips per page, so token breaks differ for the same paragraph
  • these artifacts inflate cosine on the wrong neighbors, then the reasoning step tips into symbolic collapse

maps to No.1 hallucination and chunk drift, No.11 symbolic collapse.

10-minute diagnosis

count invisible characters, detect mixed scripts, audit the space. paste and run.

# count invisible or control code points, plus soft hyphen
import unicodedata as ud

INVIS = {"\u200b","\ufeff","\u00ad","\u200e","\u200f","\u2066","\u2067","\u2068","\u2069"}

def text_noise_stats(s: str):
    n = len(s)
    ctrl = sum(1 for ch in s if ud.category(ch).startswith("C"))
    invis = sum(1 for ch in s if ch in INVIS)
    nbsp  = s.count("\u00a0")
    letters = sum(1 for ch in s if ch.isalpha())
    return {
        "len": n,
        "ctrl_frac": ctrl / n if n else 0.0,
        "invis_frac": invis / n if n else 0.0,
        "nbsp_frac":  nbsp / n if n else 0.0,
        "letter_frac": letters / n if n else 0.0
    }
Enter fullscreen mode Exit fullscreen mode
# crude mixed script score
def mixed_script_score(s: str):
    blocks = {"latin":0,"cyril":0,"greek":0,"han":0,"kana":0,"thai":0,"arabic":0,"hebrew":0,"devan":0}
    for ch in s:
        o = ord(ch)
        if   0x0041 <= o <= 0x024F: blocks["latin"] += 1
        elif 0x0400 <= o <= 0x052F: blocks["cyril"] += 1
        elif 0x0370 <= o <= 0x03FF: blocks["greek"] += 1
        elif 0x4E00 <= o <= 0x9FFF: blocks["han"]   += 1
        elif 0x3040 <= o <= 0x30FF: blocks["kana"]  += 1
        elif 0x0E00 <= o <= 0x0E7F: blocks["thai"]  += 1
        elif 0x0600 <= o <= 0x06FF: blocks["arabic"]+= 1
        elif 0x0590 <= o <= 0x05FF: blocks["hebrew"]+= 1
        elif 0x0900 <= o <= 0x097F: blocks["devan"] += 1
    total = sum(blocks.values()) or 1
    ratios = {k:v/total for k,v in blocks.items()}
    top = sorted(ratios.values(), reverse=True)[:2]
    return {"mix_ratio": top[1] if len(top)>1 else 0.0, "ratios": ratios}
Enter fullscreen mode Exit fullscreen mode
# embedding space quick audit
import numpy as np

def embedding_audit(embeddings):  # list of vectors
    X = np.array(embeddings, dtype=float)
    norms = np.linalg.norm(X, axis=1)
    v = X.var(axis=0)
    return {
        "zero_vectors": int((norms < 1e-12).sum()),
        "norm_mean": float(norms.mean()),
        "norm_min": float(norms.min()),
        "axis_near_zero_frac": float((v < 1e-6).mean())
    }
Enter fullscreen mode Exit fullscreen mode

red flags to note

  • ctrl_frac or invis_frac above 0.01 on a lot of chunks
  • mix_ratio above 0.15 for a single-language corpus
  • any non-trivial zero vectors, or high axis_near_zero_frac

minimal fix you can ship today

keep it deterministic. boring is good.

  • freeze OCR
    pin the engine version and language list per corpus. fix page segmentation mode. fix DPI or scale. disable auto rotate unless you audit.

  • normalize text
    NFC normalization. strip zero width and isolates. drop soft hyphen unless you rejoin hyphenated words. convert NBSP to space. collapse whitespace.

  • bind structure
    join hyphenated line breaks. keep captions with figures. preserve paragraphs before chunking.

  • consistent embedding prep
    one tokenizer, one pooling choice, one normalization step. remove empty texts. reject zero vectors at ingestion.

  • index hygiene
    confirm the index distance matches the model family. store a content hash for every chunk. log a one-line audit with every answer.

tiny cleaner

import unicodedata as ud
def normalize_ocr_text(s: str):
    s = s.replace("\u00ad","").replace("\u200b","").replace("\ufeff","")
    s = s.replace("\u200e","").replace("\u200f","").replace("\u00a0"," ")
    s = "".join(ch for ch in s if not (ud.category(ch).startswith("C") and ch not in "\n\t"))
    return " ".join(s.split())
Enter fullscreen mode Exit fullscreen mode

acceptance test

declare success only if all pass on three blind queries.

  • neighbor previews do not show invisible junk or mixed scripts
  • duplication and boilerplate gone from top-k
  • zero vectors not ingested, axis collapse fraction low
  • citations land inside the correct section, reranker only small deltas
  • every answer prints doc_id|section_id|page_span|N=[ids]|S=[scores]

correct pipeline order

intake and OCR
then normalization and structure binding
then chunking
then embedding with documented pooling and normalization
then index metric check and ingestion guards
then retriever and optional rerank
then synthesis and constraints

quick semantic-firewall repro

no infra change. one file and one prompt. open a fresh chat in your model provider, attach your engine pdf, paste a short prompt that asks it to answer normally, then re-answer using the engine, and print the one-line audit. you should see tighter constraint keeping, plus a visible recovery step when the chain stalls on OCR-contaminated chunks.


series index lives here
https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md

Top comments (0)