DEV Community

PSBigBig
PSBigBig

Posted on

pdfs are quietly poisoning your embedding space. here is a field fix you can ship today

day 1 of my rag problem map series. pdf parsing, header footer boilerplate, ocr noise, vectorstore hygiene, and a 60 second semantic firewall repro


most rag failures i get called into are not about the retriever or the model. they are about the embedding space getting warped by pdf artifacts. if your top-k neighbors look like cover pages or legal notices, this is for you.

who is this for

  • anyone shipping retrieval augmented generation with pdf corpora
  • folks using faiss, qdrant, elastic knn, pgvector, chroma, llamaindex, langchain
  • teams seeing good neighbor scores and still getting off by one page citations or confident hallucinations

the symptoms you probably saw

  • nearest neighbors are dominated by repeated header and footer lines
  • answers cite a glossary or toc even when the query is specific
  • reranker seems to save the day but only after a noisy top-k
  • long context inputs collapse into generic answers when the pdf has heavy layout
  • different ocr settings across files change the ranking with no code change

keywords that map here: rag pdf parsing, pdf header footer removal, ocr noise, cosine similarity drift, vector anisotropy, semantic firewall, embedding normalization, reranker overfitting, zero vector ingestion, faiss metric mismatch.

root cause in one screen

  • boilerplate dominance. repeated strings appear in hundreds of chunks. cosine or dot products love them.
  • layout flattening. naive pdf extractors break tables and detach captions. semantics gets split across chunks.
  • ocr instability. engine swaps and language auto detect inject invisible tokens and mixed scripts.
  • pooling inconsistency. cls vs mean pooling, normalization order, and truncation create a semantic not equal embedding gap.

this maps to Problem Map No.1 hallucination and chunk drift, and No.5 semantic not equal embedding.

10 minute diagnosis

copy these into a notebook and keep the output in your incident doc. we are checking duplication, metric hygiene, and space health.

# duplication across pages. high value implies header/footer pollution.
from collections import Counter

def duplication_rate(pages):
    all_lines = []
    for p in pages:
        all_lines += [ln.strip() for ln in p.splitlines() if ln.strip()]
    c = Counter(all_lines)
    dup = sum(v for _, v in c.items() if v > 1)
    return dup / max(1, len(all_lines))
Enter fullscreen mode Exit fullscreen mode
# variance scan. near zero axes mean anisotropy or collapse.
import numpy as np

def variance_scan(emb):
    emb = np.array(emb)
    v = emb.var(axis=0)
    return float((v < 1e-6).mean()), float(v.mean())
Enter fullscreen mode Exit fullscreen mode
# neighbor audit. print the raw strings, not just ids, for three blind queries.
def audit_neighbors(query, retriever, k=5):
    hits = retriever.search(query, k=k)
    for i, h in enumerate(hits):
        print(i, round(h.score, 3), h.doc_id, h.text[:160].replace("\n"," "))
Enter fullscreen mode Exit fullscreen mode

quick sanity list

  • duplication rate below 0.12 is usually safe for policy and legal pdfs
  • zero axis fraction should be near 0 on modern embeddings
  • verify index metric matches the model training assumption. cosine vs l2 should not be mixed

minimal fix you can ship today

keep it boring. boring is reliable.

  1. strip boilerplate before chunking
  • remove headers and footers
  • bind figure captions to their figures
  • keep paragraphs contiguous. do not chunk by fixed length first
  1. freeze ocr
  • fix engine and language per corpus
  • turn off auto rotation and auto script switching unless you audit outputs
  1. pooling and normalization
  • choose cls or mean pooling and document it
  • normalize once. do not normalize twice in different layers
  1. index hygiene
  • drop empty texts and zero vectors
  • confirm the distance metric
  • store a stable content hash per chunk and log it in answers
  1. light rerank after cleaning
  • reranker should adjust at the margin
  • if rerank rescues garbage, your base space is wrong

tiny example pre filter

def strip_boiler(text):
    bad = ("table of contents","glossary","all rights reserved",
           "page","confidential","copyright")
    keep = []
    for ln in text.splitlines():
        s = ln.strip().lower()
        if len(s) < 3: 
            continue
        if any(s.startswith(b) for b in bad):
            continue
        keep.append(ln)
    return "\n".join(keep)
Enter fullscreen mode Exit fullscreen mode

acceptance test

declare success only if all pass on three blind queries.

  • top-k neighbors contain no boilerplate lines
  • citation page spans land inside the correct section
  • duplication rate drops at least thirty percent from baseline
  • variance scan shows no near zero axes
  • rerank improves by a small delta, not a rescue

the mistake i keep seeing

teams tweak retriever parameters and reranker weights first. do not do that. fix intake and chunk semantics before touching retrieval. correct pipeline order looks like this:

intake and structure
then cleaning and boilerplate removal
then chunking
then embedding with documented pooling and normalization
then index metric check
then retriever and rerank
then synthesis and guardrails

reproduce a semantic firewall overlay in about a minute

no infra changes. one file plus one prompt.

  1. open a fresh chat
  2. upload the neutral pdf engine file
  3. paste
Use WFGY to answer: <your question>.
First answer normally; then re-answer using WFGY.
Compare depth, accuracy, and understanding. 
Print a one-line trace with doc_id, section_id, page_span, neighbor_ids, scores.
Enter fullscreen mode Exit fullscreen mode

you should see tighter constraint keeping and a visible recovery step if the chain stalls. this works on gpt or claude style chats that let you attach a pdf as a knowledge file.

field checklist you can copy into your runbook

  • pdf parsing and header footer removal is applied before chunking
  • ocr engine and language are pinned per corpus
  • embeddings use a single pooling method and single normalization
  • index metric matches the embedding family
  • zero vectors removed, empty texts removed, content hash stored
  • every answer logs doc_id, section_id, page_span, neighbor_ids, scores
  • reranker is optional and used only after the space is clean

mapping to my rag problem map

  • No.1 hallucination and chunk drift
  • No.5 semantic not equal embedding
  • sometimes No.8 debugging is a black box when trace lines are missing

Reference : https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md

Top comments (0)