When people hear “PDF to Markdown,” it often sounds like a simple text conversion task.
In reality, working with PDFs — especially if you care about structure — is one of the trickiest parsing problems any developer tool can encounter.
I ran into this repeatedly in documentation and LLM workflows, so I built a tool to tackle it. In this post, I’ll dig into why this problem is hard, what usually goes wrong, and how a structure-aware pipeline can make Markdown outputs much more usable.
PDFs Are Not Semantic Documents — They’re Drawing Instructions
A PDF file does not encode paragraphs, headers, or tables as high-level concepts the way HTML or Markdown does.
Instead it contains:
- Instructions to draw text at specific (x, y) coordinates
- Drawing commands for images, shapes, paths
- Transform matrices
- Optional metadata
There is no “paragraph” object in the format. All structure must be inferred from:
- Geometric proximity
- Font size and style
- Alignment and grouping
This makes the transition from PDF → Markdown fundamentally different from “text extraction.”
Two Very Different Extraction Paths
Before thinking about Markdown, we must decide which kind of PDF we’re dealing with.
Native PDFs (Text Layer Exists)
Many PDFs contain real text objects. These can be read natively:
- Extracted via PyMuPDF / pdf.js
- Include per-span positions (bboxes)
- Preserve font, glyph, and layout ordering
This is the best case for structural analysis.
Scanned PDFs (Image-Only Pages)
Some PDFs are nothing but a stack of raster images (e.g., scans):
- No text objects at all
- Everything must come from OCR
- No layout metadata remains
These fundamentally lack block information, so document structure must be reconstructed from visual cues.
Detecting which path to take is an essential first step. Treating scanned and native documents identically leads to poor outputs.
Why Most Tools Produce Low-Quality Markdown
Here are common failure modes in existing solutions:
Flattened Text
Many PDF → Markdown tools simply dump text in reading order. That yields:
- Line breaks in the wrong places
- Lost paragraph boundaries
- Broken lists
- Missing semantic grouping
This may produce Markdown, but rarely Markdown that’s easy to work with.
Over-reliance on OCR
OCR is critical for scans, but applying it to native text PDFs:
- Introduces noise
- Loses formatting
- Adds unnecessary preprocessing
The correct pipeline is to detect first, then decide.
Images With No Context
Extracting images without knowing where they belong in the flow is worthless.
In Markdown, image placement matters. Extracting raw image files without an insertion point loses meaning.
A layout-aware pipeline sorts text and image blocks together to decide the right placement.
A Block-Based Approach
The key realization is to treat PDFs as a set of layout blocks, each with:
- Bounding box
- Page number
- Content type (text / image / table / code)
Then:
- Sort all blocks by ascending (page, y, x)
- Merge spans into paragraphs and paragraphs into higher-level structures
- Reconstruct lists and tables based on geometric heuristics
- Insert images where they best fit relative to text blocks
This approach doesn’t magically discover hidden semantics. But it creates Markdown that:
- Is readable
- Doesn’t require hours of cleanup
- Respects structural relationships better than flat extraction
Scanned PDFs Are Image-First
When native text blocks are absent, all blocks must be derived from visual content.
In a scanned PDF:
- Layout info is lost
- Text must come from OCR
- Blocks must be built from visual region detection
This is a fundamentally different process than native parsing, and must be treated as such.
In tools like https://pdftomarkdown.pro, scanned PDFs are automatically detected and routed to OCR-based extraction. While OCR results are inherently noisier than native text extraction, this still provides usable Markdown where naive parsing would fail.
Handling Complex Cases
Tables
PDFs don’t represent tables explicitly. You infer structure from:
- Column alignment
- Row proximity
- Grid lines if present
Standard Markdown tables cannot express rowspan/colspan. For complex layouts, an HTML table fallback is often preferable.
Nested Lists
Bullets and indentation are visual cues only. Reconstructing nested lists requires:
- Bullet pattern detection
- Relative indentation comparison
- Grouping across lines
This is heuristic, but works reasonably well when implemented carefully.
Code Blocks
Code is often recognizable by:
- Monospaced fonts
- Consistent vertical spacing
- Absence of list/table markers
Distinguishing them accurately improves readability of outputs for technical docs.
What “Good Enough” Really Means
A perfect round-trip from PDF to Markdown is impossible in the strict sense:
- PDF has no semantic document model
- OCR has inherent error rates
- Layout inference is heuristic
But a “good enough” solution is one where:
- The Markdown is readable
- Structural elements aren’t mangled
- Images and tables aren’t orphaned
- Minimal manual cleanup is needed
For documentation, note-taking, or LLM workflows, this is far more important than pixel-perfect fidelity.
Final Thoughts
PDF was designed for printing and visual fidelity, not semantic reuse.
Converting it to Markdown is inherently a translation problem — from geometry to structure.
A structure-aware pipeline makes this translation far more reliable than naive extraction, and handling both native and scanned PDFs robustly is essential for real-world use.
If you’d like to see a practical implementation of these ideas in action, check out https://pdftomarkdown.pro/.
Feedback and edge-case examples are always welcome.
Top comments (0)