day 1 of my rag problem map series. pdf parsing, header footer boilerplate, ocr noise, vectorstore hygiene, and a 60 second semantic firewall repro
most rag failures i get called into are not about the retriever or the model. they are about the embedding space getting warped by pdf artifacts. if your top-k neighbors look like cover pages or legal notices, this is for you.
who is this for
- anyone shipping retrieval augmented generation with pdf corpora
- folks using faiss, qdrant, elastic knn, pgvector, chroma, llamaindex, langchain
- teams seeing good neighbor scores and still getting off by one page citations or confident hallucinations
the symptoms you probably saw
- nearest neighbors are dominated by repeated header and footer lines
- answers cite a glossary or toc even when the query is specific
- reranker seems to save the day but only after a noisy top-k
- long context inputs collapse into generic answers when the pdf has heavy layout
- different ocr settings across files change the ranking with no code change
keywords that map here: rag pdf parsing, pdf header footer removal, ocr noise, cosine similarity drift, vector anisotropy, semantic firewall, embedding normalization, reranker overfitting, zero vector ingestion, faiss metric mismatch.
root cause in one screen
- boilerplate dominance. repeated strings appear in hundreds of chunks. cosine or dot products love them.
- layout flattening. naive pdf extractors break tables and detach captions. semantics gets split across chunks.
- ocr instability. engine swaps and language auto detect inject invisible tokens and mixed scripts.
- pooling inconsistency. cls vs mean pooling, normalization order, and truncation create a semantic not equal embedding gap.
this maps to Problem Map No.1 hallucination and chunk drift, and No.5 semantic not equal embedding.
10 minute diagnosis
copy these into a notebook and keep the output in your incident doc. we are checking duplication, metric hygiene, and space health.
# duplication across pages. high value implies header/footer pollution.
from collections import Counter
def duplication_rate(pages):
all_lines = []
for p in pages:
all_lines += [ln.strip() for ln in p.splitlines() if ln.strip()]
c = Counter(all_lines)
dup = sum(v for _, v in c.items() if v > 1)
return dup / max(1, len(all_lines))
# variance scan. near zero axes mean anisotropy or collapse.
import numpy as np
def variance_scan(emb):
emb = np.array(emb)
v = emb.var(axis=0)
return float((v < 1e-6).mean()), float(v.mean())
# neighbor audit. print the raw strings, not just ids, for three blind queries.
def audit_neighbors(query, retriever, k=5):
hits = retriever.search(query, k=k)
for i, h in enumerate(hits):
print(i, round(h.score, 3), h.doc_id, h.text[:160].replace("\n"," "))
quick sanity list
- duplication rate below 0.12 is usually safe for policy and legal pdfs
- zero axis fraction should be near 0 on modern embeddings
- verify index metric matches the model training assumption. cosine vs l2 should not be mixed
minimal fix you can ship today
keep it boring. boring is reliable.
- strip boilerplate before chunking
- remove headers and footers
- bind figure captions to their figures
- keep paragraphs contiguous. do not chunk by fixed length first
- freeze ocr
- fix engine and language per corpus
- turn off auto rotation and auto script switching unless you audit outputs
- pooling and normalization
- choose cls or mean pooling and document it
- normalize once. do not normalize twice in different layers
- index hygiene
- drop empty texts and zero vectors
- confirm the distance metric
- store a stable content hash per chunk and log it in answers
- light rerank after cleaning
- reranker should adjust at the margin
- if rerank rescues garbage, your base space is wrong
tiny example pre filter
def strip_boiler(text):
bad = ("table of contents","glossary","all rights reserved",
"page","confidential","copyright")
keep = []
for ln in text.splitlines():
s = ln.strip().lower()
if len(s) < 3:
continue
if any(s.startswith(b) for b in bad):
continue
keep.append(ln)
return "\n".join(keep)
acceptance test
declare success only if all pass on three blind queries.
- top-k neighbors contain no boilerplate lines
- citation page spans land inside the correct section
- duplication rate drops at least thirty percent from baseline
- variance scan shows no near zero axes
- rerank improves by a small delta, not a rescue
the mistake i keep seeing
teams tweak retriever parameters and reranker weights first. do not do that. fix intake and chunk semantics before touching retrieval. correct pipeline order looks like this:
intake and structure
then cleaning and boilerplate removal
then chunking
then embedding with documented pooling and normalization
then index metric check
then retriever and rerank
then synthesis and guardrails
reproduce a semantic firewall overlay in about a minute
no infra changes. one file plus one prompt.
- open a fresh chat
- upload the neutral pdf engine file
- paste
Use WFGY to answer: <your question>.
First answer normally; then re-answer using WFGY.
Compare depth, accuracy, and understanding.
Print a one-line trace with doc_id, section_id, page_span, neighbor_ids, scores.
you should see tighter constraint keeping and a visible recovery step if the chain stalls. this works on gpt or claude style chats that let you attach a pdf as a knowledge file.
field checklist you can copy into your runbook
- pdf parsing and header footer removal is applied before chunking
- ocr engine and language are pinned per corpus
- embeddings use a single pooling method and single normalization
- index metric matches the embedding family
- zero vectors removed, empty texts removed, content hash stored
- every answer logs
doc_id, section_id, page_span, neighbor_ids, scores
- reranker is optional and used only after the space is clean
mapping to my rag problem map
- No.1 hallucination and chunk drift
- No.5 semantic not equal embedding
- sometimes No.8 debugging is a black box when trace lines are missing
Reference : https://github.com/onestardao/WFGY/blob/main/ProblemMap/article/README.md
Top comments (0)