PSBigBig

Posted on Aug 22

pdfs are quietly poisoning your embedding space. here is a field fix you can ship today

#webdev #programming #beginners #ai

day 1 of my rag problem map series. pdf parsing, header footer boilerplate, ocr noise, vectorstore hygiene, and a 60 second semantic firewall repro

most rag failures i get called into are not about the retriever or the model. they are about the embedding space getting warped by pdf artifacts. if your top-k neighbors look like cover pages or legal notices, this is for you.

who is this for

anyone shipping retrieval augmented generation with pdf corpora
folks using faiss, qdrant, elastic knn, pgvector, chroma, llamaindex, langchain
teams seeing good neighbor scores and still getting off by one page citations or confident hallucinations

the symptoms you probably saw

nearest neighbors are dominated by repeated header and footer lines
answers cite a glossary or toc even when the query is specific
reranker seems to save the day but only after a noisy top-k
long context inputs collapse into generic answers when the pdf has heavy layout
different ocr settings across files change the ranking with no code change

keywords that map here: rag pdf parsing, pdf header footer removal, ocr noise, cosine similarity drift, vector anisotropy, semantic firewall, embedding normalization, reranker overfitting, zero vector ingestion, faiss metric mismatch.

root cause in one screen

boilerplate dominance. repeated strings appear in hundreds of chunks. cosine or dot products love them.
layout flattening. naive pdf extractors break tables and detach captions. semantics gets split across chunks.
ocr instability. engine swaps and language auto detect inject invisible tokens and mixed scripts.
pooling inconsistency. cls vs mean pooling, normalization order, and truncation create a semantic not equal embedding gap.

this maps to Problem Map No.1 hallucination and chunk drift, and No.5 semantic not equal embedding.

10 minute diagnosis

copy these into a notebook and keep the output in your incident doc. we are checking duplication, metric hygiene, and space health.

# duplication across pages. high value implies header/footer pollution.
from collections import Counter

def duplication_rate(pages):
    all_lines = []
    for p in pages:
        all_lines += [ln.strip() for ln in p.splitlines() if ln.strip()]
    c = Counter(all_lines)
    dup = sum(v for _, v in c.items() if v > 1)
    return dup / max(1, len(all_lines))

# variance scan. near zero axes mean anisotropy or collapse.
import numpy as np

def variance_scan(emb):
    emb = np.array(emb)
    v = emb.var(axis=0)
    return float((v < 1e-6).mean()), float(v.mean())

# neighbor audit. print the raw strings, not just ids, for three blind queries.
def audit_neighbors(query, retriever, k=5):
    hits = retriever.search(query, k=k)
    for i, h in enumerate(hits):
        print(i, round(h.score, 3), h.doc_id, h.text[:160].replace("\n"," "))

quick sanity list

duplication rate below 0.12 is usually safe for policy and legal pdfs
zero axis fraction should be near 0 on modern embeddings
verify index metric matches the model training assumption. cosine vs l2 should not be mixed

minimal fix you can ship today

keep it boring. boring is reliable.

strip boilerplate before chunking

remove headers and footers
bind figure captions to their figures
keep paragraphs contiguous. do not chunk by fixed length first

freeze ocr

fix engine and language per corpus
turn off auto rotation and auto script switching unless you audit outputs

pooling and normalization

choose cls or mean pooling and document it
normalize once. do not normalize twice in different layers

index hygiene

drop empty texts and zero vectors
confirm the distance metric
store a stable content hash per chunk and log it in answers

light rerank after cleaning

reranker should adjust at the margin
if rerank rescues garbage, your base space is wrong

tiny example pre filter

def strip_boiler(text):
    bad = ("table of contents","glossary","all rights reserved",
           "page","confidential","copyright")
    keep = []
    for ln in text.splitlines():
        s = ln.strip().lower()
        if len(s) < 3: 
            continue
        if any(s.startswith(b) for b in bad):
            continue
        keep.append(ln)
    return "\n".join(keep)

acceptance test

declare success only if all pass on three blind queries.

top-k neighbors contain no boilerplate lines
citation page spans land inside the correct section
duplication rate drops at least thirty percent from baseline
variance scan shows no near zero axes
rerank improves by a small delta, not a rescue

the mistake i keep seeing

teams tweak retriever parameters and reranker weights first. do not do that. fix intake and chunk semantics before touching retrieval. correct pipeline order looks like this:

intake and structure
then cleaning and boilerplate removal
then chunking
then embedding with documented pooling and normalization
then index metric check
then retriever and rerank
then synthesis and guardrails

reproduce a semantic firewall overlay in about a minute

no infra changes. one file plus one prompt.

open a fresh chat
upload the neutral pdf engine file
paste

Use WFGY to answer: <your question>.
First answer normally; then re-answer using WFGY.
Compare depth, accuracy, and understanding. 
Print a one-line trace with doc_id, section_id, page_span, neighbor_ids, scores.

you should see tighter constraint keeping and a visible recovery step if the chain stalls. this works on gpt or claude style chats that let you attach a pdf as a knowledge file.

field checklist you can copy into your runbook

pdf parsing and header footer removal is applied before chunking
ocr engine and language are pinned per corpus
embeddings use a single pooling method and single normalization
index metric matches the embedding family
zero vectors removed, empty texts removed, content hash stored
every answer logs doc_id, section_id, page_span, neighbor_ids, scores
reranker is optional and used only after the space is clean

mapping to my rag problem map

No.1 hallucination and chunk drift
No.5 semantic not equal embedding
sometimes No.8 debugging is a black box when trace lines are missing

Reference : https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md

DEV Community