What actually breaks when you turn PDFs into Markdown

Dmitry Petrakov — Mon, 29 Jun 2026 12:05:08 +0000

"Convert a PDF to Markdown" sounds like a solved problem. Take the text out, turn headings into #, turn tables into pipes, done.

After building a converter for it, I have a less satisfying answer: the easy cases are easy, and the hard cases are not edge cases. They are the documents people actually care about – research papers, annual reports, invoices, scanned contracts, specs, and the table-heavy PDFs someone wants to feed into an LLM.

Disclosure: I build pdf2md.dev, so I have skin in this game. This is not a benchmark claiming "we're the best." It is a breakdown of the failure modes I had to handle and the trade-offs I made – written so it's useful even if you never touch my tool and just want to evaluate your own.

A PDF is not a document

The core problem is that a PDF does not usually contain "a document" the way Markdown, HTML, or DOCX does.

It contains drawing instructions: put these glyphs at these coordinates, draw this line here, place this image there. The structure you see as a reader is reconstructed by your eyes:

that larger bold text is probably a heading
those aligned numbers are probably a table
that block on the left should be read before the block on the right
that superscript belongs to a formula
that scanned page has no text layer at all

A converter has to rebuild all of that from layout, geometry, OCR, and heuristics. If it only "extracts text," it will work on the demo PDF and fall apart on the first real report.

Here are the main things that break, roughly ordered by how often they bite.

1. Tables are not one problem

Simple tables are fine. If the PDF has a clean grid and each cell maps to one row and one column, Markdown is a good target:

| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1      | $1.2M   | +8%    |
| Q2      | $1.4M   | +17%   |

Renders to:

Quarter	Revenue	Growth
Q1	$1.2M	+8%
Q2	$1.4M	+17%

The trouble starts when the table stops being a simple grid.

Merged cells do not map cleanly to GitHub-flavored Markdown. Nested headers have a hierarchy Markdown tables cannot represent. Rotated tables add a reading-order problem before you even get to cell detection. Borderless tables are worst of all, because the grid exists only as alignment.

This is where converters quietly get dishonest. They return a table that looks tidy but has shifted columns, duplicated headers, or numbers attached to the wrong labels. That is more dangerous than an obvious failure – especially if the output flows into an LLM or a RAG index, where nobody re-reads it.

My rule for this class of problem: preserve as much structure as the target format can honestly express, and don't pretend Markdown can encode everything a PDF table visually implies. Straight grids come out ready to use. Complex financial or scientific tables may still need a visual check. That is less magical, but it is the difference between saving time and silently corrupting data.

2. Reading order is layout analysis, not text extraction

Academic papers, magazines, datasheets, and many reports use two or three columns. A naive extractor reads across the page by x/y position and produces nonsense:

First line of column one first line of column two
second line of column one second line of column two

The right behavior is to detect column boundaries, read each column top-to-bottom, then move on. That requires layout analysis – the text stream alone is not enough.

The same problem hits sidebars, captions, footnotes, running headers, and page numbers. A human ignores a repeated header automatically; a converter has to decide whether those fragments are content, metadata, or noise. Get it wrong and you don't just produce ugly Markdown – you change the meaning.

3. Formulas are reconstructed, not copied

Mathematical notation is a layout problem too. In a PDF, a formula is a set of glyphs placed carefully on the page: ∑, √, superscripts, subscripts, fraction bars, Greek letters, spacing. Turning that back into something usable means producing LaTeX-like text:

$$
\sum_{i=1}^{n} x_i
$$

Renders to:

If the converter only sees characters in approximate order, an equation becomes a line of floating symbols – useless for documentation, search, or LLM context. This is why I don't trust regex-only PDF pipelines for technical documents. They're fine for plain text; formulas need the converter to understand visual structure.

4. Scanned PDFs change the entire pipeline

A scanned PDF may have no embedded text at all – it's just images of pages. Now the problem is OCR, with its own failure modes:

scan quality dominates everything
skewed or low-contrast pages hurt recognition
tiny text and dense tables are slow
handwriting is not reliably recognized
OCR produces plausible-looking mistakes, which are the worst kind

For printed or typeset text, good scans convert well. A sharp 300 DPI page with high contrast is a completely different input from a crooked phone photo of a faded fax.

There's also a product decision every converter has to make: what happens when a long scan exceeds the processing budget? Failing the whole job is simple to implement and a terrible experience. The behavior I chose is to return the Markdown produced within the budget and mark the job as truncated – a partial result with an explicit signal, instead of losing everything. The signal is the important part. A partial result without a truncation marker is just another form of silent data loss.

5. Images are either content or noise

Images in PDFs are ambiguous. Sometimes they're essential – diagrams, charts, stamps, signatures. Sometimes they're decorative backgrounds. Sometimes the whole page is an image but the user wants text, not embedded base64.

So "include images" is not one setting. The practical version is three different intents:

embed images when the Markdown should be self-contained
use placeholders when the user wants clean text output
run OCR on scanned pages when text needs to be recovered

There is no universal best choice. A Markdown file headed for a knowledge base, an LLM prompt, or a legal archive each wants different output.

6. The converter itself can fail before the Markdown does

The visible part of a converter is the Markdown. The part that decides whether you can trust it is job reliability, and those failures are boring:

a conversion hangs forever
a heavy OCR job runs out of memory
a worker dies halfway through
a job gets retried too aggressively
one large file blocks everyone else
the user closes the tab before the result is ready

This is where a weekend script and a service diverge. My implementation ended up with a real job lifecycle:

The system tracks each job, retries bounded failures, applies time budgets, deletes input files after processing, and keeps results only for a short retention window. Those limits are not glamorous, but they are part of trust. A converter that accepts anything, promises instant results, and never explains retention isn't more user-friendly – it's just hiding the operational reality.

Why two engines instead of one

There is no single engine that wins on every document, so I run two.

MinerU is the default. It holds up better on dense documents, heavy OCR, Cyrillic content, and table-heavy scans, and it's the safer choice under memory pressure. Docling is an opt-in: faster and cleaner on simple, well-structured text PDFs, but less forgiving on heavy full-OCR workloads.

So the question isn't "which engine is best?" – it's "which engine is best for this document?" That's an unsatisfying marketing answer and a useful engineering one.

How I'd evaluate any PDF-to-Markdown tool

If you're picking a converter, mine or anyone's, don't start with the landing page. Test it with documents that expose different failure modes:

a simple text PDF with headings and lists
a two-column paper with footnotes
a table with merged headers
a scanned invoice or contract
a technical paper with formulas
a document with screenshots or diagrams
a long PDF that might hit a time budget

As you read the output, the useful questions split in two. First, did the structure survive: does the reading order match the original, are the tables actually correct rather than merely tidy, do the formulas come back usable, and does the OCR admit when it can't read handwriting instead of inventing words? Check for the truncation marker too, because a partial result that isn't labelled as partial is a quiet failure.

The second group is the one people skip, and it's the one that matters most: can you find the file-size limit, the retention window and the privacy policy without digging, can you delete a job yourself, and does the tool explain its limits instead of hiding them? PDF conversion usually touches private documents, so a converter has to earn trust before output quality even comes up.

Privacy, in plain language

Here's the model I wanted, stated the way I think every converter should state it:

you can convert without creating an account
uploaded PDFs are deleted after processing
results are kept only for a short download window, then removed automatically
you can delete a job manually
documents are never used to train models
documents are not sold or used for advertising

The full privacy notice spells it out, and the developer docs cover the API and a hosted MCP endpoint for agent workflows. I'm putting this in the article because retention and training policy are product features when you're asking people to upload contracts and reports, not fine print to bury three clicks deep.

Try your worst PDF

The best test isn't a clean sample document. It's the PDF that already broke your previous workflow – a dense financial table, a two-column paper full of formulas, a long scanned report, something you want to feed into an LLM without losing structure.

Throw it at the web app and see how far the honest 90% gets you. No signup, files auto-deleted.

If it works, great. If it breaks, I genuinely want to know which document exposed the failure – drop the kind of PDF you're fighting in the comments. The last 20% of this problem isn't one bug; it's a long list of document-specific edge cases, and real examples are how converters get better.

Written from first-hand work on the project; I used an AI assistant to tighten the structure, not to invent the technical claims.

DEV Community: Dmitry Petrakov