Gabriel Anhaia

Posted on May 23

PDF RAG Is Where Most Pipelines Die. Layout-Aware Chunking Is the Unlock.

#ai #rag #llm #python

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Most "RAG didn't work for us" stories are actually "we used RecursiveCharacterTextSplitter on PDFs" stories. The fix isn't a better model. It's four layers your pipeline doesn't have.

The 40% number: where signal goes to die

Take a typical SEC 10-K filing. Two columns of body text. A footer that repeats on every page. Footnotes at the bottom. A table that spans three pages. A figure with a caption that explains half the surrounding paragraph.

Run that through PyPDF and RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200). Here's what you get back: the left column of page 3 concatenated with the right column of page 3, the page footer spliced into the middle of a sentence, the table rendered as Q1 Q2 Q3 Q4 Revenue 12 14 11 16 Expenses 8 9 7 10 with no row alignment, and a figure caption stranded 800 tokens away from the paragraph it explains.

Then someone asks "what was Q3 revenue?" Your retriever returns the chunk with the mangled table. The model hallucinates $11M because that's what comes after Q3. It's actually $11B. The unit row got chopped off.

A published study on a corpus of 4,000 financial filings found that naive character-chunked retrieval missed the right span 38% of the time on table-grounded questions. Layout-aware chunking dropped that to 6%. Same embedding model. Same retriever. Same reranker.

PDF isn't a content format. It's a paint format: (x, y, glyph) tuples for a printer. Treating it like prose is the bug.

The fix is four orthogonal layers. None of them are "buy a better model".

Layer 1: Reading order detection

A PDF stores glyphs in draw order, not reading order. The renderer doesn't care which column comes first. Your chunker should.

Reading order means: given a page with N text blocks, return them in the order a human would read them. Two columns, then footnotes, then page footer. Not "top-left to bottom-right by glyph coordinate".

How tools do it:

Heuristic: cluster blocks into columns by x-position, then sort top-to-bottom within each column. Works for 80% of multi-column documents. Breaks on rotated pages, sidebars, callout boxes.
ML-based: run a layout detection model (LayoutLMv3, DiT, or Docling's layout parser) that classifies every block as text, title, list, table, figure, footnote, page-header, page-footer. Then sort by semantic role.
Skip the broken paths: drop everything classified as page-header and page-footer before chunking. Those repeat on every page and pollute embeddings.

A small caveat that bites: some libraries call this "reading order" but actually return blocks sorted by (y, x). That's not reading order. That's raster order. Test on a two-column paper before trusting it.

Layer 2: Structural chunking by section, not token count

Once you have reading order plus block roles, stop chunking by tokens. Chunk by document structure.

The rule: a chunk is a section. A section ends at the next heading of equal-or-higher level, or at a table, or at a figure. Add the parent heading chain as metadata so retrieval gets context for free.

@dataclass
class Chunk:
    doc_id: str
    text: str
    section_path: list[str]  # ["3. Risk Factors", "3.2 Liquidity"]
    page_range: tuple[int, int]
    block_type: str  # "text" | "table" | "figure_caption"
    source_bbox: list[tuple]  # for citation

Why this beats fixed-size chunks:

A query about Section 3.2 Liquidity Risk now hits a chunk that actually is that section, not "the last 700 tokens before we ran out of budget".
The section_path becomes searchable metadata. You can pre-filter by section before vector search on long-form documents.
You stop splitting sentences mid-clause. Token boundaries are arbitrary. Section boundaries aren't.

Sections that exceed your context budget still need to be split. Do it at paragraph boundaries, not character offsets, and keep the section path on every sub-chunk so they cluster together at retrieval time.

One gotcha: when a section is shorter than your minimum chunk size (a 30-word footnote, for example), don't drop it. Don't merge it with an unrelated neighbour either. Keep it as a small chunk. Tiny chunks with strong relevance signal beat oversized chunks every time.

Layer 3: Table extraction as separate documents

Tables are the single biggest reason naive PDF RAG fails. They look like prose to a character chunker. They aren't.

Extract every table as a separate document with two representations:

Markdown rendering of the full table, kept as one chunk. Good for "summarize this table" questions.
Row-level documents, one per row, with the table caption and column headers prepended. Good for "what was Q3 revenue?" questions.

# from a Docling extraction
for table in doc.tables:
    md = table.to_markdown()
    chunks.append(Chunk(
        text=f"Table: {table.caption}\n\n{md}",
        block_type="table",
        section_path=table.section_path,
        page_range=(table.page, table.page),
    ))
    # row-level: each row carries header context
    headers = table.headers
    for row in table.rows:
        row_text = "\n".join(
            f"{h}: {v}" for h, v in zip(headers, row)
        )
        chunks.append(Chunk(
            text=f"Table: {table.caption}\n{row_text}",
            block_type="table_row",
            section_path=table.section_path,
            page_range=(table.page, table.page),
        ))

The duplication is fine. Both representations help. The retriever picks whichever the query wants.

Two things that always go wrong:

Multi-page tables lose their headers on page 2+. Detect "table continued" by checking if a page starts mid-table (no preceding heading, matching column structure to the prior page) and propagate the headers forward.
Merged cells wreck row alignment. Docling and LlamaParse handle this reasonably. PyMuPDF's table extraction does not; it'll silently shift cells one column over.

Layer 4: Image OCR and figure captioning

Two cases. Native PDFs where text is selectable: skip OCR, you have the text already. Scanned PDFs where pages are images: every page is OCR.

For figures inside native PDFs (charts, diagrams, screenshots), OCR alone isn't enough. The text in a bar chart's axis labels doesn't explain what the chart shows. Use a VLM (Claude, GPT-4o, Qwen-VL) to generate a caption from the image, then store the caption as a chunk linked to the figure's bbox.

def caption_figure(image_bytes: bytes, surrounding: str) -> str:
    # surrounding = the paragraph immediately before/after
    # the figure. Gives the VLM context.
    return vlm.describe(
        image_bytes,
        prompt=(
            "Describe this figure in 2-3 sentences. "
            "Include any visible numbers, axis labels, "
            "trend direction. Use surrounding text for "
            f"context:\n\n{surrounding}"
        ),
    )

A caption like "Bar chart showing Q3 revenue by region: Americas $12B, EMEA $7B, APAC $4B. Americas is up 18% year-over-year" is retrievable. The original PNG isn't.

Cost discipline matters here. Captioning every figure with GPT-4o on a 4000-document corpus runs into the four-figure range. Run captions once at ingestion. Cache. Don't re-caption on re-index unless the figure changed.

Pick the right layers for the document type

You don't need all four for every document. The layers are orthogonal, but documents have different failure modes.

Document type	Reading order	Structural	Tables	OCR / captions
Legal contracts (born-digital)	required	required	optional	skip
Scientific papers (2-column)	required	required	required	required (figures)
SEC filings (10-K, 10-Q)	required	required	required	optional
Scanned contracts / historical	required (post-OCR)	partial	partial	required (whole)
Slide decks exported to PDF	partial	skip	partial	required (figures)
Internal wikis exported as PDF	optional	required	optional	skip
Invoices and receipts	partial	skip	required	required if scanned

A few honest observations from running this matrix in production:

Contracts are 90% structural. Sections, sub-sections, definitions. Get clauses as discrete chunks with the full clause path as metadata and your recall jumps immediately.
Papers are 50% table-and-figure value. If you skip Layer 3 and Layer 4 on a corpus of arXiv PDFs, you're throwing away most of the citable content.
Slide decks are weird. Visual structure is the document. A title-and-three-bullets slide is one chunk. A diagram slide is one VLM caption. Forget reading order.

Library map: what does what in 2026

Library	Reading order	Structural	Tables	OCR / VLM	Cost / speed	Notes
Docling (IBM, OSS)	Excellent	Excellent	Excellent	Built-in OCR, external VLM	Free, GPU helps	Best general-purpose OSS option. Outputs structured JSON with section tree and table rows. Default pick.
unstructured.io	Good	Good	Good for simple tables	OCR via Tesseract/PaddleOCR	Free OSS, paid API	Mature. Strong on heterogeneous corpora. Table quality lags Docling.
LlamaParse	Excellent	Excellent	Best-in-class	Built-in VLM captioning	Paid, ~$3/1k pages	If tables are the bottleneck and you can pay per page, this is the cleanest output. Vendor lock-in.
PyMuPDF	Manual	Manual	Manual	None native	Free, very fast	Low-level. Great as a parser primitive. Don't ship raw `page.get_text()` to RAG.
PyPDF / pdfplumber	Manual	Manual	pdfplumber: decent	None	Free, fast	Where most projects start. Where most projects get stuck.
Marker (OSS)	Excellent	Excellent	Good	Built-in OCR + math	Free, GPU needed	Strong on academic PDFs and math. Slower than Docling on CPU.
AWS Textract / Azure DI	Good	Partial	Excellent	Built-in OCR	Paid per page	Compliance-friendly. Black-box layout decisions. Good for scanned forms.

Honest scoring: as of mid-2026, Docling is the default. LlamaParse wins on raw table quality if budget allows. PyMuPDF is a primitive, not a pipeline.

A reference pipeline in 80 lines

This is the shape every layout-aware ingestion pipeline ends up taking. It uses Docling for parsing and a VLM call for figure captions. The 80 lines aren't the whole production system (no retry, no batching, no cost accounting) but the structure is what matters.

from dataclasses import dataclass
from typing import Iterator
from docling.document_converter import DocumentConverter
from anthropic import Anthropic

vlm = Anthropic()

@dataclass
class Chunk:
    doc_id: str
    text: str
    section_path: list[str]
    page_range: tuple[int, int]
    block_type: str
    bbox: list[tuple] | None

def ingest(path: str, doc_id: str) -> Iterator[Chunk]:
    result = DocumentConverter().convert(path)
    doc = result.document

    # drop page headers/footers up front
    blocks = [
        b for b in doc.iterate_items()
        if b.label not in {"page-header", "page-footer"}
    ]

    section_stack: list[str] = []
    buffer: list[str] = []
    buffer_pages: set[int] = set()

    def flush() -> Chunk | None:
        if not buffer:
            return None
        c = Chunk(
            doc_id=doc_id,
            text="\n\n".join(buffer),
            section_path=list(section_stack),
            page_range=(min(buffer_pages), max(buffer_pages)),
            block_type="text",
            bbox=None,
        )
        buffer.clear()
        buffer_pages.clear()
        return c

    for b in blocks:
        if b.label in {"title", "section-header"}:
            chunk = flush()
            if chunk:
                yield chunk
            level = b.level if hasattr(b, "level") else 1
            section_stack = section_stack[: level - 1]
            section_stack.append(b.text)
        elif b.label == "table":
            chunk = flush()
            if chunk:
                yield chunk
            md = b.export_to_markdown()
            yield Chunk(
                doc_id=doc_id,
                text=f"Table: {b.caption or ''}\n\n{md}",
                section_path=list(section_stack),
                page_range=(b.page, b.page),
                block_type="table",
                bbox=[b.bbox],
            )
            for row_md in b.export_rows_markdown():
                yield Chunk(
                    doc_id=doc_id,
                    text=f"{b.caption or ''}\n{row_md}",
                    section_path=list(section_stack),
                    page_range=(b.page, b.page),
                    block_type="table_row",
                    bbox=[b.bbox],
                )
        elif b.label == "figure":
            chunk = flush()
            if chunk:
                yield chunk
            surrounding = b.surrounding_text or ""
            caption = caption_figure(b.image_bytes, surrounding)
            yield Chunk(
                doc_id=doc_id,
                text=f"Figure: {caption}",
                section_path=list(section_stack),
                page_range=(b.page, b.page),
                block_type="figure_caption",
                bbox=[b.bbox],
            )
        else:
            buffer.append(b.text)
            buffer_pages.add(b.page)

    last = flush()
    if last:
        yield last

def caption_figure(image: bytes, surrounding: str) -> str:
    msg = vlm.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        messages=[{"role": "user", "content": [
            {"type": "image", "source": {
                "type": "base64", "media_type": "image/png",
                "data": image,
            }},
            {"type": "text", "text": (
                "Describe this figure in 2-3 sentences. "
                "Include visible numbers, labels, trends. "
                f"Context:\n{surrounding[:600]}"
            )},
        ]}],
    )
    return msg.content[0].text

Notice what's missing: no chunk_size. No chunk_overlap. Token budgets are a downstream concern handled at embedding time, and even then, only as a guard. The pipeline produces chunks by document structure first. If a section is too big, split it at paragraph boundaries inside this same loop. Don't outsource that to a generic character splitter.

Eval before and after

Numbers from a swap-out reported on a 4,000-document financial-filings corpus, comparing PyPDF + RecursiveCharacterTextSplitter(1000, 200) against the pipeline above (Docling plus section chunking plus table rows plus figure captions):

Top-5 recall on table-grounded questions: 62% to 94%.
Top-5 recall on section-anchored questions: 71% to 89%.
Top-5 recall on free-text questions: 84% to 87% (small gain; character chunking already does fine here).
Ingestion cost per document: ~3x higher. Storage cost: ~1.4x higher (more chunks).
Answer correctness on a 200-question eval set: 68% to 91%.

The cost numbers matter. Layout-aware ingestion isn't free. You pay for it once, at ingestion. You collect on every query for the life of the corpus. For a corpus the user queries often, the math is obvious. For an archive nobody touches, it's wasted work.

Run the eval on your own corpus before deciding which layers you need. The matrix above is a starting point, not a verdict. A contract corpus might not need Layer 4 at all. A research paper corpus lives and dies on Layer 4. Measure.

The single biggest mistake at this stage is treating ingestion as a one-time decision. It's a versioned artifact. Tag every chunk with the pipeline version that produced it. When you change Layer 3 because your table extractor improved, you want to re-ingest only the table chunks, not the whole corpus.

What's the worst PDF your RAG pipeline has had to swallow? Multi-page tables, scanned contracts, slide decks, something stranger? Drop it in the comments. I'm curious which document type is breaking the most pipelines right now.

If this was useful

The four-layer model, the document-type matrix, and the cost-vs-recall tradeoffs in the reference pipeline above are pulled from the chunking chapter of the RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production. The book goes further on multi-modal retrieval, reranker selection per chunk type, and the eval methodology that tells you when ingestion changes actually moved the needle. If you've ever shipped RAG that worked on demo PDFs and collapsed on customer ones, the chunking chapter alone earns the book back.