Edward

Posted on Jan 2

Why Converting PDFs to Markdown Is Harder Than It Looks

#pdf #ocr #markdown #developer

When people hear “PDF to Markdown,” it often sounds like a simple text conversion task.

In reality, working with PDFs — especially if you care about structure — is one of the trickiest parsing problems any developer tool can encounter.

I ran into this repeatedly in documentation and LLM workflows, so I built a tool to tackle it. In this post, I’ll dig into why this problem is hard, what usually goes wrong, and how a structure-aware pipeline can make Markdown outputs much more usable.

PDFs Are Not Semantic Documents — They’re Drawing Instructions

A PDF file does not encode paragraphs, headers, or tables as high-level concepts the way HTML or Markdown does.

Instead it contains:

Instructions to draw text at specific (x, y) coordinates
Drawing commands for images, shapes, paths
Transform matrices
Optional metadata

There is no “paragraph” object in the format. All structure must be inferred from:

Geometric proximity
Font size and style
Alignment and grouping

This makes the transition from PDF → Markdown fundamentally different from “text extraction.”

Two Very Different Extraction Paths

Before thinking about Markdown, we must decide which kind of PDF we’re dealing with.

Native PDFs (Text Layer Exists)

Many PDFs contain real text objects. These can be read natively:

Extracted via PyMuPDF / pdf.js
Include per-span positions (bboxes)
Preserve font, glyph, and layout ordering

This is the best case for structural analysis.

Scanned PDFs (Image-Only Pages)

Some PDFs are nothing but a stack of raster images (e.g., scans):

No text objects at all
Everything must come from OCR
No layout metadata remains

These fundamentally lack block information, so document structure must be reconstructed from visual cues.

Detecting which path to take is an essential first step. Treating scanned and native documents identically leads to poor outputs.

Why Most Tools Produce Low-Quality Markdown

Here are common failure modes in existing solutions:

Flattened Text

Many PDF → Markdown tools simply dump text in reading order. That yields:

Line breaks in the wrong places
Lost paragraph boundaries
Broken lists
Missing semantic grouping

This may produce Markdown, but rarely Markdown that’s easy to work with.

Over-reliance on OCR

OCR is critical for scans, but applying it to native text PDFs:

Introduces noise
Loses formatting
Adds unnecessary preprocessing

The correct pipeline is to detect first, then decide.

Images With No Context

Extracting images without knowing where they belong in the flow is worthless.

In Markdown, image placement matters. Extracting raw image files without an insertion point loses meaning.

A layout-aware pipeline sorts text and image blocks together to decide the right placement.

A Block-Based Approach

The key realization is to treat PDFs as a set of layout blocks, each with:

Bounding box
Page number
Content type (text / image / table / code)

Then:

Sort all blocks by ascending (page, y, x)
Merge spans into paragraphs and paragraphs into higher-level structures
Reconstruct lists and tables based on geometric heuristics
Insert images where they best fit relative to text blocks

This approach doesn’t magically discover hidden semantics. But it creates Markdown that:

Is readable
Doesn’t require hours of cleanup
Respects structural relationships better than flat extraction

Scanned PDFs Are Image-First

When native text blocks are absent, all blocks must be derived from visual content.

In a scanned PDF:

Layout info is lost
Text must come from OCR
Blocks must be built from visual region detection

This is a fundamentally different process than native parsing, and must be treated as such.

In tools like https://pdftomarkdown.pro, scanned PDFs are automatically detected and routed to OCR-based extraction. While OCR results are inherently noisier than native text extraction, this still provides usable Markdown where naive parsing would fail.

Handling Complex Cases

Tables

PDFs don’t represent tables explicitly. You infer structure from:

Column alignment
Row proximity
Grid lines if present

Standard Markdown tables cannot express rowspan/colspan. For complex layouts, an HTML table fallback is often preferable.

Nested Lists

Bullets and indentation are visual cues only. Reconstructing nested lists requires:

Bullet pattern detection
Relative indentation comparison
Grouping across lines

This is heuristic, but works reasonably well when implemented carefully.

Code Blocks

Code is often recognizable by:

Monospaced fonts
Consistent vertical spacing
Absence of list/table markers

Distinguishing them accurately improves readability of outputs for technical docs.

What “Good Enough” Really Means

A perfect round-trip from PDF to Markdown is impossible in the strict sense:

PDF has no semantic document model
OCR has inherent error rates
Layout inference is heuristic

But a “good enough” solution is one where:

The Markdown is readable
Structural elements aren’t mangled
Images and tables aren’t orphaned
Minimal manual cleanup is needed

For documentation, note-taking, or LLM workflows, this is far more important than pixel-perfect fidelity.

Final Thoughts

PDF was designed for printing and visual fidelity, not semantic reuse.

Converting it to Markdown is inherently a translation problem — from geometry to structure.

A structure-aware pipeline makes this translation far more reliable than naive extraction, and handling both native and scanned PDFs robustly is essential for real-world use.

If you’d like to see a practical implementation of these ideas in action, check out https://pdftomarkdown.pro/.

Feedback and edge-case examples are always welcome.

DEV Community