- Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Most "RAG didn't work for us" stories are actually "we used RecursiveCharacterTextSplitter on PDFs" stories. The fix isn't a better model. It's four layers your pipeline doesn't have.
The 40% number: where signal goes to die
Take a typical SEC 10-K filing. Two columns of body text. A footer that repeats on every page. Footnotes at the bottom. A table that spans three pages. A figure with a caption that explains half the surrounding paragraph.
Run that through PyPDF and RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200). Here's what you get back: the left column of page 3 concatenated with the right column of page 3, the page footer spliced into the middle of a sentence, the table rendered as Q1 Q2 Q3 Q4 Revenue 12 14 11 16 Expenses 8 9 7 10 with no row alignment, and a figure caption stranded 800 tokens away from the paragraph it explains.
Then someone asks "what was Q3 revenue?" Your retriever returns the chunk with the mangled table. The model hallucinates $11M because that's what comes after Q3. It's actually $11B. The unit row got chopped off.
A published study on a corpus of 4,000 financial filings found that naive character-chunked retrieval missed the right span 38% of the time on table-grounded questions. Layout-aware chunking dropped that to 6%. Same embedding model. Same retriever. Same reranker.
PDF isn't a content format. It's a paint format: (x, y, glyph) tuples for a printer. Treating it like prose is the bug.
The fix is four orthogonal layers. None of them are "buy a better model".
Layer 1: Reading order detection
A PDF stores glyphs in draw order, not reading order. The renderer doesn't care which column comes first. Your chunker should.
Reading order means: given a page with N text blocks, return them in the order a human would read them. Two columns, then footnotes, then page footer. Not "top-left to bottom-right by glyph coordinate".
How tools do it:
- Heuristic: cluster blocks into columns by x-position, then sort top-to-bottom within each column. Works for 80% of multi-column documents. Breaks on rotated pages, sidebars, callout boxes.
-
ML-based: run a layout detection model (LayoutLMv3, DiT, or Docling's layout parser) that classifies every block as
text,title,list,table,figure,footnote,page-header,page-footer. Then sort by semantic role. -
Skip the broken paths: drop everything classified as
page-headerandpage-footerbefore chunking. Those repeat on every page and pollute embeddings.
A small caveat that bites: some libraries call this "reading order" but actually return blocks sorted by (y, x). That's not reading order. That's raster order. Test on a two-column paper before trusting it.
Layer 2: Structural chunking by section, not token count
Once you have reading order plus block roles, stop chunking by tokens. Chunk by document structure.
The rule: a chunk is a section. A section ends at the next heading of equal-or-higher level, or at a table, or at a figure. Add the parent heading chain as metadata so retrieval gets context for free.
@dataclass
class Chunk:
doc_id: str
text: str
section_path: list[str] # ["3. Risk Factors", "3.2 Liquidity"]
page_range: tuple[int, int]
block_type: str # "text" | "table" | "figure_caption"
source_bbox: list[tuple] # for citation
Why this beats fixed-size chunks:
- A query about Section 3.2 Liquidity Risk now hits a chunk that actually is that section, not "the last 700 tokens before we ran out of budget".
- The
section_pathbecomes searchable metadata. You can pre-filter by section before vector search on long-form documents. - You stop splitting sentences mid-clause. Token boundaries are arbitrary. Section boundaries aren't.
Sections that exceed your context budget still need to be split. Do it at paragraph boundaries, not character offsets, and keep the section path on every sub-chunk so they cluster together at retrieval time.
One gotcha: when a section is shorter than your minimum chunk size (a 30-word footnote, for example), don't drop it. Don't merge it with an unrelated neighbour either. Keep it as a small chunk. Tiny chunks with strong relevance signal beat oversized chunks every time.
Layer 3: Table extraction as separate documents
Tables are the single biggest reason naive PDF RAG fails. They look like prose to a character chunker. They aren't.
Extract every table as a separate document with two representations:
- Markdown rendering of the full table, kept as one chunk. Good for "summarize this table" questions.
- Row-level documents, one per row, with the table caption and column headers prepended. Good for "what was Q3 revenue?" questions.
# from a Docling extraction
for table in doc.tables:
md = table.to_markdown()
chunks.append(Chunk(
text=f"Table: {table.caption}\n\n{md}",
block_type="table",
section_path=table.section_path,
page_range=(table.page, table.page),
))
# row-level: each row carries header context
headers = table.headers
for row in table.rows:
row_text = "\n".join(
f"{h}: {v}" for h, v in zip(headers, row)
)
chunks.append(Chunk(
text=f"Table: {table.caption}\n{row_text}",
block_type="table_row",
section_path=table.section_path,
page_range=(table.page, table.page),
))
The duplication is fine. Both representations help. The retriever picks whichever the query wants.
Two things that always go wrong:
- Multi-page tables lose their headers on page 2+. Detect "table continued" by checking if a page starts mid-table (no preceding heading, matching column structure to the prior page) and propagate the headers forward.
- Merged cells wreck row alignment. Docling and LlamaParse handle this reasonably. PyMuPDF's table extraction does not; it'll silently shift cells one column over.
Layer 4: Image OCR and figure captioning
Two cases. Native PDFs where text is selectable: skip OCR, you have the text already. Scanned PDFs where pages are images: every page is OCR.
For figures inside native PDFs (charts, diagrams, screenshots), OCR alone isn't enough. The text in a bar chart's axis labels doesn't explain what the chart shows. Use a VLM (Claude, GPT-4o, Qwen-VL) to generate a caption from the image, then store the caption as a chunk linked to the figure's bbox.
def caption_figure(image_bytes: bytes, surrounding: str) -> str:
# surrounding = the paragraph immediately before/after
# the figure. Gives the VLM context.
return vlm.describe(
image_bytes,
prompt=(
"Describe this figure in 2-3 sentences. "
"Include any visible numbers, axis labels, "
"trend direction. Use surrounding text for "
f"context:\n\n{surrounding}"
),
)
A caption like "Bar chart showing Q3 revenue by region: Americas $12B, EMEA $7B, APAC $4B. Americas is up 18% year-over-year" is retrievable. The original PNG isn't.
Cost discipline matters here. Captioning every figure with GPT-4o on a 4000-document corpus runs into the four-figure range. Run captions once at ingestion. Cache. Don't re-caption on re-index unless the figure changed.
Pick the right layers for the document type
You don't need all four for every document. The layers are orthogonal, but documents have different failure modes.
| Document type | Reading order | Structural | Tables | OCR / captions |
|---|---|---|---|---|
| Legal contracts (born-digital) | required | required | optional | skip |
| Scientific papers (2-column) | required | required | required | required (figures) |
| SEC filings (10-K, 10-Q) | required | required | required | optional |
| Scanned contracts / historical | required (post-OCR) | partial | partial | required (whole) |
| Slide decks exported to PDF | partial | skip | partial | required (figures) |
| Internal wikis exported as PDF | optional | required | optional | skip |
| Invoices and receipts | partial | skip | required | required if scanned |
A few honest observations from running this matrix in production:
- Contracts are 90% structural. Sections, sub-sections, definitions. Get clauses as discrete chunks with the full clause path as metadata and your recall jumps immediately.
- Papers are 50% table-and-figure value. If you skip Layer 3 and Layer 4 on a corpus of arXiv PDFs, you're throwing away most of the citable content.
- Slide decks are weird. Visual structure is the document. A title-and-three-bullets slide is one chunk. A diagram slide is one VLM caption. Forget reading order.
Library map: what does what in 2026
| Library | Reading order | Structural | Tables | OCR / VLM | Cost / speed | Notes |
|---|---|---|---|---|---|---|
| Docling (IBM, OSS) | Excellent | Excellent | Excellent | Built-in OCR, external VLM | Free, GPU helps | Best general-purpose OSS option. Outputs structured JSON with section tree and table rows. Default pick. |
| unstructured.io | Good | Good | Good for simple tables | OCR via Tesseract/PaddleOCR | Free OSS, paid API | Mature. Strong on heterogeneous corpora. Table quality lags Docling. |
| LlamaParse | Excellent | Excellent | Best-in-class | Built-in VLM captioning | Paid, ~$3/1k pages | If tables are the bottleneck and you can pay per page, this is the cleanest output. Vendor lock-in. |
| PyMuPDF | Manual | Manual | Manual | None native | Free, very fast | Low-level. Great as a parser primitive. Don't ship raw page.get_text() to RAG. |
| PyPDF / pdfplumber | Manual | Manual | pdfplumber: decent | None | Free, fast | Where most projects start. Where most projects get stuck. |
| Marker (OSS) | Excellent | Excellent | Good | Built-in OCR + math | Free, GPU needed | Strong on academic PDFs and math. Slower than Docling on CPU. |
| AWS Textract / Azure DI | Good | Partial | Excellent | Built-in OCR | Paid per page | Compliance-friendly. Black-box layout decisions. Good for scanned forms. |
Honest scoring: as of mid-2026, Docling is the default. LlamaParse wins on raw table quality if budget allows. PyMuPDF is a primitive, not a pipeline.
A reference pipeline in 80 lines
This is the shape every layout-aware ingestion pipeline ends up taking. It uses Docling for parsing and a VLM call for figure captions. The 80 lines aren't the whole production system (no retry, no batching, no cost accounting) but the structure is what matters.
from dataclasses import dataclass
from typing import Iterator
from docling.document_converter import DocumentConverter
from anthropic import Anthropic
vlm = Anthropic()
@dataclass
class Chunk:
doc_id: str
text: str
section_path: list[str]
page_range: tuple[int, int]
block_type: str
bbox: list[tuple] | None
def ingest(path: str, doc_id: str) -> Iterator[Chunk]:
result = DocumentConverter().convert(path)
doc = result.document
# drop page headers/footers up front
blocks = [
b for b in doc.iterate_items()
if b.label not in {"page-header", "page-footer"}
]
section_stack: list[str] = []
buffer: list[str] = []
buffer_pages: set[int] = set()
def flush() -> Chunk | None:
if not buffer:
return None
c = Chunk(
doc_id=doc_id,
text="\n\n".join(buffer),
section_path=list(section_stack),
page_range=(min(buffer_pages), max(buffer_pages)),
block_type="text",
bbox=None,
)
buffer.clear()
buffer_pages.clear()
return c
for b in blocks:
if b.label in {"title", "section-header"}:
chunk = flush()
if chunk:
yield chunk
level = b.level if hasattr(b, "level") else 1
section_stack = section_stack[: level - 1]
section_stack.append(b.text)
elif b.label == "table":
chunk = flush()
if chunk:
yield chunk
md = b.export_to_markdown()
yield Chunk(
doc_id=doc_id,
text=f"Table: {b.caption or ''}\n\n{md}",
section_path=list(section_stack),
page_range=(b.page, b.page),
block_type="table",
bbox=[b.bbox],
)
for row_md in b.export_rows_markdown():
yield Chunk(
doc_id=doc_id,
text=f"{b.caption or ''}\n{row_md}",
section_path=list(section_stack),
page_range=(b.page, b.page),
block_type="table_row",
bbox=[b.bbox],
)
elif b.label == "figure":
chunk = flush()
if chunk:
yield chunk
surrounding = b.surrounding_text or ""
caption = caption_figure(b.image_bytes, surrounding)
yield Chunk(
doc_id=doc_id,
text=f"Figure: {caption}",
section_path=list(section_stack),
page_range=(b.page, b.page),
block_type="figure_caption",
bbox=[b.bbox],
)
else:
buffer.append(b.text)
buffer_pages.add(b.page)
last = flush()
if last:
yield last
def caption_figure(image: bytes, surrounding: str) -> str:
msg = vlm.messages.create(
model="claude-opus-4-7",
max_tokens=200,
messages=[{"role": "user", "content": [
{"type": "image", "source": {
"type": "base64", "media_type": "image/png",
"data": image,
}},
{"type": "text", "text": (
"Describe this figure in 2-3 sentences. "
"Include visible numbers, labels, trends. "
f"Context:\n{surrounding[:600]}"
)},
]}],
)
return msg.content[0].text
Notice what's missing: no chunk_size. No chunk_overlap. Token budgets are a downstream concern handled at embedding time, and even then, only as a guard. The pipeline produces chunks by document structure first. If a section is too big, split it at paragraph boundaries inside this same loop. Don't outsource that to a generic character splitter.
Eval before and after
Numbers from a swap-out reported on a 4,000-document financial-filings corpus, comparing PyPDF + RecursiveCharacterTextSplitter(1000, 200) against the pipeline above (Docling plus section chunking plus table rows plus figure captions):
- Top-5 recall on table-grounded questions: 62% to 94%.
- Top-5 recall on section-anchored questions: 71% to 89%.
- Top-5 recall on free-text questions: 84% to 87% (small gain; character chunking already does fine here).
- Ingestion cost per document: ~3x higher. Storage cost: ~1.4x higher (more chunks).
- Answer correctness on a 200-question eval set: 68% to 91%.
The cost numbers matter. Layout-aware ingestion isn't free. You pay for it once, at ingestion. You collect on every query for the life of the corpus. For a corpus the user queries often, the math is obvious. For an archive nobody touches, it's wasted work.
Run the eval on your own corpus before deciding which layers you need. The matrix above is a starting point, not a verdict. A contract corpus might not need Layer 4 at all. A research paper corpus lives and dies on Layer 4. Measure.
The single biggest mistake at this stage is treating ingestion as a one-time decision. It's a versioned artifact. Tag every chunk with the pipeline version that produced it. When you change Layer 3 because your table extractor improved, you want to re-ingest only the table chunks, not the whole corpus.
What's the worst PDF your RAG pipeline has had to swallow? Multi-page tables, scanned contracts, slide decks, something stranger? Drop it in the comments. I'm curious which document type is breaking the most pipelines right now.
If this was useful
The four-layer model, the document-type matrix, and the cost-vs-recall tradeoffs in the reference pipeline above are pulled from the chunking chapter of the RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production. The book goes further on multi-modal retrieval, reranker selection per chunk type, and the eval methodology that tells you when ingestion changes actually moved the needle. If you've ever shipped RAG that worked on demo PDFs and collapsed on customer ones, the chunking chapter alone earns the book back.

Top comments (0)