DEV Community: Jerome

Many people talked about RAG, but the first-principle of RAG quality most people omitted is: garbage in garbage out. Shared a deep dive of 5 open-source PDF to Markdown converters, which may save you some time when building the RAG pipeline.

Jerome — Thu, 09 Jul 2026 04:23:59 +0000

Jerome

Jul 9

I benchmarked 5 open-source PDF to Markdown tools for RAG, on real documents (2026)

#opensource #ai #rag #python

9 min read

I benchmarked 5 open-source PDF to Markdown tools for RAG, on real documents (2026)

Jerome — Thu, 09 Jul 2026 04:03:14 +0000

If you feed PDFs to an LLM or a RAG pipeline, the PDF-to-Markdown step is where quality quietly dies. The text comes out fine, then a table collapses, a formula shatters, and a number changes to something that looks right but isn't.

So I benchmarked the open-source converters people actually reach for: MinerU, Marker, Docling, PyMuPDF4LLM, MarkItDown. I ran them on five deliberately hard documents (a research paper, a Japanese financial report, a US legal brief, a page of nothing but data tables, and a 1973 scan) and scored what an LLM actually needs. I also threw a few hosted tools (Mathpix, CloudConvert, pdf2md, and PDF·Markdown App) in for comparison, since sometimes you just want it done without a Python setup.

Two of the nine silently changed numbers. More on that below.

It's all reproducible. Every source document, all nine tools' raw outputs, the answer keys, and the scoring scripts are open:

→ Benchmark repo on GitHub · Full methodology & per-document scores

Pick one in ten seconds

TL;DR

Open source (run it yourself):

Strongest overall: MinerU (top-end VLM). Excellent tables and formulas, but it deletes footnote text and turns contents lines into fake H1s. The VLM wants a GPU.
Safe, balanced default: Marker. Consistent, good on formulas and figures; stumbles on multi-column layout and reference lists. (Check its commercial-use license.)
RAG pipelines: MinerU, Marker and Docling all keep headings.
- Skip MarkItDown: it emits zero Markdown headings, which kills heading-based chunking.
Speed on simple PDFs: PyMuPDF4LLM, until a figure breaks the reading order.

Just want it done, no setup:

Mathpix has the cleanest formulas, but it silently drops numbers in dense tables.
CloudConvert is fine for a one-off, but collapses dense tables into a single cell.
pdfmarkdown.app runs in the browser with nothing uploaded, and stayed the most consistent across every document type.

Capability profiles

Here's how the nine compare across five capabilities. A bigger, rounder shape is a better all-rounder; a dent is a weak spot. It's five hard documents, not a representative sample, so read the shapes, not fine rankings. (The full per-document scores are on the original article.)

The open-source tools

MinerU: strongest overall, two blind spots

Highest recovery of any open-source tool here.

Strong: perfect tables (including a brutal 26-row one) and perfect formulas.
Deletes footnote text: on the legal brief it kept every footnote marker but dropped the text, so 10 of 11 citations point at nothing.
Fake headings: on the financial report it promoted about 13 contents lines into H1s.
Setup: I ran the top-end VLM, a few points above the lighter pipeline on MinerU's own OmniDocBench numbers (about 91 vs 86); it needs a GPU.

MinerU nails the hard tables, including Toyota's dense metrics grid.

Formulas come through intact.

The original brief: its legal citations sit in page-bottom footnotes.

MinerU keeps the superscript markers but drops the footnote text (red box), so the references point at nothing.

Marker: the balanced default

Consistently good, rarely the winner.

Strong: formulas and figures, and it never fell apart on any document.
Scrambles multi-column blocks: it interleaved a two-column attorney signature block into one address.
Breaks reference lists: a Table of Authorities came out with each case name split from its page number.
Detaches row labels: it stripped the labels off a US-GAAP table, leaving 29,929,992 with nothing to say it's net sales.
License: commercial-use conditions worth checking before you ship it.

Tables render cleanly, and the math survives too, which are Marker's real strengths.

The original: a two-column block, left column then right column.

Marker interleaves the two columns line by line, scrambling the reading order (red box).

The original US-GAAP table, with every row labelled: net sales, total assets, and so on.

Marker keeps the numbers but drops the row labels (red box): 29,929,992 with nothing to say it is net sales.

Docling: honest, and allergic to math

Clean text, and it fails out loud instead of faking.

Strong: clean tables, and it marks anything it can't handle as undecoded rather than inventing it. Zero silent errors in the whole test.
Drops every formula and image.
Mangled a table of contents into a duplicated table.
Good fit for text- and table-heavy RAG, poor for anything with math or figures.

Docling's tables are clean and well aligned.

It drops the formulas and images, marking them as undecoded rather than inventing something wrong (red box). Honest, but a gap.

PyMuPDF4LLM: fast, until layout carries meaning

A quick Python library, clean and very fast on simple single-column PDFs.

Strong: speed and dead-simple integration.
Reading order breaks at figures: on the paper it collapsed at the Figure 1 boundary and severed whole sections.
Drops most display formulas.

On the paper, the reading order breaks around the figures and the display formulas and images fall out (red box).

MarkItDown: no headings, anywhere

Handles many formats with a simple API, but it disqualifies itself for RAG.

Zero Markdown headings on every document I tested; section titles become plain text, so heading-based chunking loses its outline.
Tables and formulas shatter.
Drops images.
No OCR: scans need an external OCR or LLM you wire in yourself.

Section titles come out as plain text: not a single Markdown heading in the whole file (red box), so the outline is gone.

Tables and formulas break down as well.

The hosted & browser tools

These need no Python setup, which is the whole appeal when you just want a file converted. The fidelity trade-offs are real, though, especially the silent ones.

Mathpix: beautiful output, with a knife in it

The single cleanest result in the test, with a perfect score on the paper. But that clean output hides two silent errors:

Dropped dividends: on the financial report it kept only the parenthetical interim, so an AI reads 100 where the real number is 220.
Swapped a total: on the data tables it replaced 87,416 with 100.0%.
Both look completely normal. Cloud, paid, with a roughly 10-page free tier.

On the research paper, Mathpix is the cleanest of all nine: formulas and figures come through beautifully.

The original dividend row: the annual figure (220, 240, 260, 300, 375) with the interim dividend in parentheses.

Mathpix keeps only the parenthetical figure (red box): 100, 105, 120, and so on. An AI now reads 100 as the dividend when the real number is 220. The table looks perfectly normal.

CloudConvert: valid tables that are secretly unusable

Emits proper Markdown tables, which sounds good, until a dense one.

Strong: easy, no install, many formats.
Crams rows into one cell: the Markdown is valid but useless as data.
Formulas mangled, figures dropped.

On dense tables it crams several rows into a single cell (red box): valid Markdown, but you can't tell which number belongs to which row.

Formulas come out badly mangled.

And figures are dropped entirely.

pdf2md: the one that quietly cuts your numbers off

The scariest visible failure.

Truncates numbers at the thousands separator (458,614 becomes 458,). At least the dangling comma is greppable.
No real tables, every formula shatters, images dropped.
Invents headings: it promoted a copyright notice and formula fragments into 30+ headings.
One genuine silent swap on the financial report, on top of the visible damage.

The original land-use data table.

pdf2md: no table at all, and numbers cut off at the thousands separator (red box), so 546,030 and 87,416 lose their tails.

The Attention formula shattered across headings and code blocks: the fraction is gone.

pdfmarkdown.app: mine, including where it loses

I'll be straight, since it's mine.

Strong: top tier across all five documents with zero silent errors. It kept Toyota's minus signs, bracketed counts and dividends, got all five reading-order seams right on the legal brief, and produced the most complete figure crops of anything I tested.
Where it loses: MinerU and Mathpix edged it on the pure paper; it kept the scan's cover as an image but didn't OCR the cursive title; and on the nastiest multi-level tables (the Attention paper's Table 4) it crammed parser rows into stacked cells, where MinerU reads the layout better.
It's a browser tool, not a scriptable library, so for automated batch pipelines an open-source option fits better.

The Toyota table comes through with the minus signs and bracketed temp-worker counts intact, where Mathpix and pdf2md broke.

Figure 2 comes out whole, both the Scaled Dot-Product and Multi-Head Attention panels with their caption, and the equation renders as clean LaTeX.

The original: Table 4 from the Attention paper, where each parser is its own row.

pdfmarkdown.app on the same table: the four cited-parser rows get crammed into single stacked cells (red box). The values survive but the row structure blurs, and this is the kind of dense table where MinerU is sharper.

How it was scored, and the raw data

Two axes:

Recovery score: the share of a fixed per-document checklist that came through correctly.
Silent-error count: plausible-but-wrong values, kept separate and never averaged in.

I graded one criterion at a time across all nine tools, checked it by hand against the source PDFs, and ran an adversarial second pass. The full writeup, with the methodology, the capability-score derivation, and the reproducible test set and raw outputs, lives on the original article:

→ Best PDF to Markdown Tools in 2026: 9 Converters Tested on Real Documents

I'm Jerome, builder of pdfmarkdown.app. I included direct competitors and tried to credit each fairly. If you think I got a call wrong, tell me.

How to extract figures from a PDF without breaking them

Jerome — Sat, 20 Jun 2026 00:08:00 +0000

Almost everyone who runs a PDF through our converter does the same thing next: they hand the result to an AI. They paste it into ChatGPT, drop it into NotebookLM, or save it to a notes app that reads and summarizes for them. So that is the job I hold us to. Not "produce a Markdown file." Produce a Markdown file an AI can actually read without losing the plot.

Disclosure: I build pdfmarkdown.app, an in-browser PDF→Markdown converter, so I have a horse in this race. The examples below are real, run on real papers, and the claims are checkable. Test anything here yourself.

That goal gives us three rules we try hard not to break:

AI-friendly first. The output is going to be read by a machine, so anything that carries meaning has to survive. If a person glances at it and it looks roughly right, that is not enough. The AI reads what is actually there, not what you assume is there.
Lose nothing. Keep the original information, including the small stuff. The tick numbers on a chart. The fact that two pieces are really one figure. The caption that belongs to it.
Keep it checkable. You should be able to look at the result and trust it, or spot quickly where it went wrong.

Easy to write down. The place those rules get tested hardest is figures, especially in academic papers. So let me show you what goes wrong, using a paper you probably know, and then how to get it right.

A chart that looks fine and isn't

I ran the ResNet paper through a converter recently, to read it alongside an AI. One of its charts came out looking fine at a glance. Then I went to read the actual numbers off it, and there weren't any. The plotted lines were there, the legend was there, but every number on both axes had quietly vanished. No error rate up the side, no iteration count along the bottom.

In the paper, the chart has a scale and units. After a typical extraction, the lines survive and everything that tells you what they mean is gone.

To your eye, skimming, you might not even notice. To an AI, that chart is now close to noise. It sees a few lines drifting downward and has no idea down to what, or over how long. The information that made it a chart is gone, and nothing flagged that it left. That is the worst kind of error: the silent one.

Figures break in two ways

Once we started pulling on this, figures turn out to break in two distinct ways when a PDF becomes text. They have different causes, so they are worth naming separately.

One, information goes missing. The figure is there, but text inside it disappears: axis numbers, axis names, labels on a diagram. That is the chart above.

Two, the layout falls apart. What you see as one figure comes out as several disconnected pieces, often stacked in the wrong order, with the caption attached to only one of them. Here is the attention diagram from the Transformers paper. You know it as a single figure. A typical extraction returns it as two separate images, one piled on top of the other, because the file never said they were one picture.

One figure, stored as two images with nothing linking them, put back together as one.

Why it happens

None of this is the paper's fault, and it is not really the converter being lazy either. It comes down to what a PDF actually is.

A PDF is a picture of a page, not the page's meaning. It records where every mark sits, and that is mostly all it records. It does not say "this is a figure," or "these digits are the scale on an axis," or "these two images are one picture." Put plainly: a PDF doesn't know it has a chart. We checked both of these papers, and none of that meaning is written down anywhere in the file. So whatever rebuilds the page has to work it out from a pile of marks.

That explains both failures.

The missing labels are a font story. Most of the text on that ResNet page survives fine, because the file carries the fonts it needs. But the axis numbers happen to use one of a small set of "standard" fonts that the PDF format assumes every reader already has, so the file does not bother to include it. We rebuild each figure privately, right inside your browser, so your document is never uploaded anywhere. In that private setting there is no copy of that one standard font to draw from, and those particular characters come out blank. It is like a recipe that says "add the house spice blend." Fine in the kitchen that mixed it, useless to you at home with a different set of jars. Everything else on the page kept its own fonts, which is exactly why only the axis numbers went missing.

The split figure is simpler. The file stored that one diagram as two separate images and never linked them. A quick pass treats them as two figures, stacks them, and pins the caption to one. It is a jigsaw tipped out with no picture on the lid and no hint about which pieces make which image.

How to get figures out whole

So the fix is to do the extra work the file skips. For the missing text, we catch the labels that are about to disappear and draw them back in, in the right place and at the right size, before the figure is saved. For the split figure, we look at where the pieces sit, work out that they are one picture, put them back together, and reattach the caption. The fixed chart at the top of this page and the merged figure above are both the real results, run on the real papers.

One thing I care about: we only fix what is broken. For figures that were already complete, we change nothing at all. They come out identical to before, down to the byte. The job is to restore what the PDF dropped, not to repaint things that were already right.

And if you do not even want Markdown, if you just want the figures themselves as clean images, you can take those too. They come out whole, labels and all.

How to tell if your converter is breaking figures

You do not need to take my word for it. Whatever tool you use, run this quick check on a PDF that has charts or multi-part diagrams:

Open a chart in the result and look for the numbers. Are the axis labels and tick numbers still there? If the lines survived but the scale did not, the chart is decorative now, not data.
Find a figure that is really two pictures side by side. Did it come out as one figure with its caption, or did it fall apart into separate images?
Check a diagram's inner labels. Boxes and arrows with no words are just boxes and arrows.

That one look tells you more about a tool's fidelity than any feature list. (If you want a starting point, I keep a running comparison of PDF-to-Markdown tools.)

This is what high fidelity means to us in practice. Not a slogan about quality, but a stubborn refusal to let your document quietly lose pieces of itself on the way to an AI. Charts keep their axes. Figures stay whole. The caption stays with its figure.

You can try ours on your own file right now. It runs entirely in your browser, your document is never uploaded, and your figures keep their labels and their shape.

How to Convert a Markdown File to PDF (Pandoc, VS Code, or Just Your Browser)

Jerome — Fri, 19 Jun 2026 06:57:17 +0000

Originally published at pdfmarkdown.app.

Markdown quietly became the default writing format of the AI era. ChatGPT and Claude answer in it, every README and wiki is written in it, Obsidian and Notion notes live in it. The search numbers say people have noticed: Google searches for "markdown to pdf" are up roughly 10× over the past year, and "md file to pdf" more than 20×.

The funny part is what happens when that Markdown has to leave your ecosystem. Send a raw .md file to a client or a manager and what they see is programmer scribbles, asterisks and pound signs included. For all its ubiquity, Markdown still has no good way to just share a document and trust it will look right on the other person's screen, especially a phone. So we do what people have always done: flatten it into a PDF, the one format that renders the same everywhere.

And then finding a tool for that turns out to be its own little ordeal, which is why you're reading this. There are three good ways to do the conversion, and the right one depends on how often you do it and how much you like terminals. I'll walk through all three, including the parts that bite.

Disclosure: the third option is mine. I build pdfmarkdown.app, which includes a browser-based Markdown to PDF converter. I've tried to be fair to the other two; both are genuinely good at what they do, and I use Pandoc myself.

The short version

You script things and want full control? Pandoc. The most powerful option and the only sane one for converting hundreds of files, at the cost of a LaTeX install measured in gigabytes.
You live in VS Code and convert occasionally? The Markdown PDF extension. It's right there in your editor.
You just want a clean PDF now, with nothing to install? A browser tool. Mine is a free, browser-based Markdown to PDF tool: paste your Markdown and download the PDF when it looks right.

Option 1: Pandoc, the command-line workhorse

Pandoc converts basically any document format into any other. For Markdown to PDF, the basic command is one line:

pandoc notes.md -o notes.pdf

If that worked on the first try on a fresh machine, you got lucky. The usual greeting is:

'pdflatex' not found. Please select a different --pdf-engine or install 'pdflatex'

This is the catch nobody mentions up front: Pandoc doesn't make PDFs by itself. By default it hands the work to LaTeX, which you install separately, and the full distributions (TeX Live on Linux, MacTeX on macOS) run to several gigabytes. If that sounds absurd for converting some notes, TinyTeX is a much smaller distribution built for exactly this situation.

Once it runs, the default look is distinctly academic: the Computer Modern serif of a classic LaTeX paper. It's not ugly (that's a respected, very readable typeface), just formal in a way that can feel out of place in a quick note to a non-technical colleague. The tables, for the record, come out clean. A few flags steer it toward something more everyday:

pandoc notes.md -o notes.pdf -V geometry:margin=1in -V fontsize=12pt --toc

-V sets layout variables like margins and font size, and --toc adds a table of contents.

The second classic trap is any character beyond plain English. Feed the default engine CJK text (Chinese, Japanese, Korean) or Cyrillic and it doesn't quietly drop it, it halts outright with Unicode character 中 (U+4E2D) not set up for use with LaTeX. The fix is to switch to the xelatex engine and name a font that contains your glyphs:

pandoc notes.md -o notes.pdf --pdf-engine=xelatex -V CJKmainfont="Songti SC"

Two gotchas I hit running exactly this on a fresh TinyTeX. First, you need the CJK package: tlmgr install xecjk. Second, not every font name resolves. macOS's own PingFang SC would not load for me (xelatex couldn't find it), while Songti SC worked; on Windows try Microsoft YaHei, on Linux Noto Sans CJK SC. And emoji? In my testing they vanish even after all of this, so don't count on them.

The upside to all this fiddling is that it's front-loaded. Once the install and the font flags are sorted, the setup keeps working: the same command converts the same way next week and next month, so you pay the tax once and then mostly forget it's there. And once configured, Pandoc is unbeatable for repetition. This converts a whole folder:

for f in docs/*.md; do pandoc "$f" -o "${f%.md}.pdf"; done

If Markdown to PDF is part of a build pipeline or a nightly job, learn Pandoc and don't look back.

Option 2: VS Code, if you're already sitting in it

Install the Markdown PDF extension (the popular one is by yzane), open your file, right-click in the editor, and pick "Markdown PDF: Export (pdf)". That's the whole workflow, which is exactly the appeal.

Under the hood it prints the page with a headless Chromium browser, which the extension downloads on first use, so expect the first export to take a while. The browser engine is good news for output quality though: you get familiar GitHub-style rendering and code highlighting without configuring anything.

The friction shows up when you want it to look different. Custom styling means writing CSS files and pointing the markdown-pdf.styles setting at them, and controlling where pages break means adding CSS rules like page-break-after to your document. Converting a pile of files is also awkward, since everything is built around the editor's one-file-at-a-time flow.

For the occasional "send this doc to someone" moment while you're coding anyway, it's the path of least resistance.

Option 3: your browser, when you just want the PDF

This is the one I built. pdfmarkdown.app/markdown-to-pdf runs in your browser: paste your Markdown (or drop a .md file, or a .zip of Markdown plus its images) and the pages build live in front of you, exactly as they'll export. Free, no signup.

The part I obsessed over is page breaks. The classic failure of quick converters is a table sliced in half across a page edge, or a heading stranded alone at the bottom of page 3 while its section starts on page 4. Here the layout keeps tables, code blocks and figures whole, and because the preview is the actual paginated document, you see any problem before you download rather than after. Long code lines wrap inside the block instead of running off the right edge (a place pandoc's defaults will spill on you), math renders properly, and there are five themes (Clean, Editorial, Academic, Compact, Technical) to match the document to its reader.

Smart page breaks in action: a block that won't fit is moved whole to the next page rather than sliced across the page edge. Tables, code and figures stay intact.

Exported straight from the browser: math typeset properly, and a long code line wrapped inside the block instead of spilling off the page edge.

One thing I built specifically for the share-it-on-a-phone case from the top of this post: a Phone page size. Most PDFs are A4 or Letter, which on a phone means tiny pinch-to-zoom text. The Phone size lays the page out tall and narrow so the text comes out big and readable on a phone screen with no zooming, which is often exactly the device the person you're sending it to is holding.

Switch the page size to Phone for a tall, narrow PDF that reads on a phone without zooming. Whatever the script (CJK, Cyrillic, accented Latin) plus emoji, it renders with no font setup.

Plenty of other web converters do Markdown to PDF, from the long-running markdowntopdf.com to a steady stream of newer ones. If you go that route, one piece of advice from reading a year's worth of user threads while researching this space: judge the exported file, not the preview. The most common complaint about web converters, by far, is a beautiful preview that exports to a broken PDF, with bold text gone, links dead, or CJK and emoji missing. That's exactly why the preview here is the paginated document: what you see is what downloads.

Honest boundaries: it's a web page, not a pipeline. If you need two hundred files converted on a schedule, that's Pandoc territory today. Browser-based batch is on my mind, though, and so are other gaps (Mermaid diagrams, say). If there's something you'd use that it doesn't do yet, tell me what you need and the real use cases are what move it up the list. And if you're going the other direction, turning a PDF into Markdown, that's the main thing pdfmarkdown.app does.

Already writing in Obsidian or Typora?

Then you may not need a converter at all. Both can export the current document to PDF directly (in Obsidian it's the "Export to PDF" command), and for a quick whole-document export that's usually enough. The ceiling is control: Obsidian exports the entire note whether you want all of it or not, and in both, fine-tuning the look or the page breaks means digging into custom CSS. When you hit that ceiling, the three routes above give you more room.

Turning a README (or any GitHub doc) into a PDF

This one comes up constantly: a README.md or a docs folder has to go to a client or an auditor who would be confused by a GitHub link. GitHub has no export-to-PDF button, so you have two options.

With Pandoc, tell it the input is GitHub-flavored Markdown so tables and task lists survive:

pandoc README.md -f gfm -o README.pdf

In the browser, paste the raw file into pdfmarkdown.app/markdown-to-pdf. If the README references local images, zip the folder and drop the zip in so the images resolve.

Either way, consider deleting the badge row first (the little build-status shields at the top). Badges are made for repo pages and rarely make sense in a document.

Frequently asked questions

How do I convert a Markdown file to PDF without installing anything?
Use a free online converter that runs in your browser. pdfmarkdown.app/markdown-to-pdf needs no signup, and shows you the paginated result live before you download it.

How do I convert a Markdown table to PDF without it breaking?
Tables are where most converters stumble: wide ones get their right edge cut off, or the rows collapse into a mess on export. Pandoc handles them well if you pass -f gfm; in the browser, pdfmarkdown.app/markdown-to-pdf keeps each table whole and won't slice one across a page edge. Whatever you use, judge the downloaded file, not the on-screen preview.

Why does Pandoc fail with "pdflatex not found"?
Pandoc delegates PDF generation to a LaTeX engine that isn't installed yet. Install a TeX distribution (TinyTeX if you want small, TeX Live or MacTeX if you want complete), or point --pdf-engine at an engine you already have.

How do I convert a GitHub README to PDF?
GitHub itself can't do it. Either run pandoc README.md -f gfm -o README.pdf on the command line (the -f gfm flag keeps GitHub-style tables intact), or paste the raw Markdown into a browser converter.

What's the best way to batch convert many Markdown files to PDF?
Pandoc in a shell loop: for f in *.md; do pandoc "$f" -o "${f%.md}.pdf"; done. Browser tools and editor extensions are built around one document at a time.

I'm Jerome, the builder of pdfmarkdown.app, a free, browser-based PDF↔Markdown tool. Two of the three options above aren't mine, and I genuinely reach for Pandoc when I'm batch-converting. If I got something wrong, tell me at hey@pdfmarkdown.app.

Your AI agent can't grep a PDF, and it's burning your tokens 🔥

Jerome — Fri, 12 Jun 2026 14:41:17 +0000

Your coding agent can grep your whole repo in milliseconds. It can't treat a PDF the same way.

A PDF is not AI-friendly by default. Even when it contains selectable text, the structure that matters to an agent often gets lost or has to be guessed back: reading order, tables, formulas, captions, and figures. That extraction is lossy, and it is not free.

Here's what's going on under the hood, and why converting once to clean Markdown is the fix.

Disclosure: I build pdfmarkdown.app, an in-browser PDF→Markdown converter, so weigh that accordingly. I've kept the claims checkable, so test them yourself.

A PDF is a picture, not text

Most PDFs don't store your sentences. They store where each glyph sits on the page. "Married" might be saved as a run of positioned glyphs with no record that they form a word, that the word belongs to that paragraph, or that the left column should be read before the right one. (Tagged PDFs can carry logical structure and reading order, but in the wild they're rare or unreliable, so tools can't count on them.)

A human eye reassembles all that instantly. Software has to guess it back, and that guessing is where things break.

A PDF knows where each glyph sits, not the order it should be read in. Markdown stores the order and the structure, which is exactly what a model needs.

The five places it breaks

When a converter (or a model) takes that guess, five things tend to fall apart, and they're the parts that carry the actual meaning:

The five breakpoints: scanned pages, multi-column order, tables, formulas, and images.

Scanned pages are just images. No text layer at all. Without OCR, the model "sees" a photo and quietly makes things up.
Multi-column pages read in the wrong order. A two-column paper gets stitched left-half-line then right-half-line, so sentences interleave into nonsense.
Tables collapse. Rows and columns flatten into one run-on line. The number that was under "2024" ends up floating next to a label from a different row.
Formulas turn to gibberish. E = mc² becomes E mc2, subscripts and superscripts drift, and an equation the paper is about becomes unreadable.
Figures lose their meaning. A chart gets dropped, or at best pulled out as a bare image. In a text or Markdown pipeline (RAG, search, an agent grepping over text), that image carries no meaning. A vision model could look at it, but your retrieval index and your grep can't.

The fix: clean Markdown is the format AI actually reads well

Markdown is plain text with light, explicit structure: # for headings, real rows and columns for tables, fenced blocks for code. The plainness is the whole point:

The structure is stated, not guessed. The reading order, the table shape and the hierarchy are all written down.
It's greppable and token-cheap. It's plain text, so an agent can search it line by line, and there's no binary cruft for a model to wade through.
Models were trained on mountains of it (every README, every wiki, every docs site), so they parse it natively.

Convert the PDF once into clean Markdown and you've done the hard, lossy extraction a single time, deliberately, instead of making every tool redo it (badly) on every query.

Where llms.txt fits in

This is the same idea behind llms.txt, an emerging convention where a site publishes a plain-Markdown map of its important content so AI tools can read it directly, instead of fighting through rendered HTML or PDFs. If you want AI to read something, hand it clean Markdown. A PDF on your disk and a webpage an AI crawls have the exact same problem, and the exact same fix.

Turning a PDF into AI-ready Markdown: what to watch

If you convert a PDF, judge the result on the parts that actually break, not on whether the first paragraph looks fine. Check four things:

Did the tables survive as real rows and columns?
Did the formulas survive as readable math?
Were scanned pages recognized, or silently handed back as garbage?
Did the figures make it into the output at all?

This is the bar I hold pdfmarkdown.app to: it runs in your browser, shows you the original PDF and the Markdown side by side so you can check those four things before you trust the output, and when a page is genuinely hard (a scan with no text layer) it says so up front instead of faking it. It's a floor I can show you, not a "perfect conversion" promise, because nobody can honestly make that one.

Original PDF on the left, generated Markdown on the right. This is the Attention Is All You Need paper: the figure keeps its caption, and the equation comes through as real math.

"But models keep getting smarter, won't this just go away?"

Maybe the accuracy improves. Two things don't, and they get more important as agents take over, not less:

1. Tokens. A PDF has to be parsed into text before a model can do anything with it. In the naive pattern (attach the PDF to each chat) you re-pay that parse on every turn. Prompt caching and RAG soften it, but they're working around the same root cause: the PDF was never text to begin with. Convert it once to Markdown and the parse is done for good: cheap to embed, cheap to search, cheap to ask about.

2. Agents read on demand. Claude Code and Codex don't slurp whole files into context; they grep and search for the few lines they need, when they need them. A PDF can't be searched that way without first extracting it to text, which is exactly "convert it to Markdown." Do it once and your agent treats it like any other file in the repo.

How an agent actually reads: Markdown lets it pull the three lines it needs. A PDF has to be decoded whole before it can search at all.

So the trend runs opposite to the intuition. As AI shifts from chatting with one document to agents navigating a whole library of code and docs, the PDF becomes a bigger bottleneck, not a smaller one. Better models make the agent pattern more common, which makes clean Markdown more necessary, not less.

"I just keep my PDFs in Obsidian, do I still need this?"

Especially then. A vault lives or dies on what you can search, link and fold into other notes, and a raw PDF sitting in it is a dead end: you can't [[link]] to a heading inside it, can't pull one paragraph into a daily note, can't grep it. Convert it to Markdown and the PDF becomes a first-class note like everything else, readable by you and by any AI you point at your vault.

The short version

Most PDFs store where glyphs sit, not what they say in what order, so anything reading one has to guess, and guesses worst on tables, formulas, multi-column pages, scans and figures.
You can't grep or embed a PDF until it's been extracted to text. Clean Markdown is that text, with the structure intact: greppable, token-cheap, and what models read natively. llms.txt is the same idea for the web.
Smarter models don't retire the problem. Token cost and agent-style on-demand reading make converting-once-to-Markdown more valuable over time.

Convert a PDF to clean Markdown once, glance over it to confirm the tables and formulas came through, and from then on every tool, model and agent you hand it to reads the real thing instead of guessing at the original.

The Best PDF to Markdown Tools in 2026 (Honestly Compared)

Jerome — Wed, 10 Jun 2026 14:23:24 +0000

Turning a PDF into Markdown sounds simple until you try it on a real document. The text comes out fine. Then the tables collapse into mush, the formulas turn to gibberish, the figures vanish, and a two-column research paper reads in the wrong order. Markdown is how documents get fed to AI tools, pasted into notes, and stored in wikis, so "mostly right" usually isn't good enough.

I compared the tools people actually reach for, judged on the parts that break: tables, formulas, images, scanned pages, reading order, and how much setup it takes to get there.

Upfront disclosure: I'm the maker of pdfmarkdown.app, one of the tools below — so factor that in. I've tried hard to be fair; every other tool here is genuinely good at something, and I say so. Check the claims yourself; tools change.

The short version

Just want clean Markdown without installing anything? Use a browser tool like pdfmarkdown.app: private, no signup, and you can see what you're getting before you trust it.
A developer building a RAG or document pipeline? Reach for an open-source library: Marker, Docling, or MarkItDown.
Mostly heavy math, scientific papers, or handwriting? Mathpix is the specialist.
An occasional, mixed-format conversion? A general converter like CloudConvert is fine.

There's no single winner. The right pick depends on whether you live in a terminal, and what's actually in your PDFs.

pdfmarkdown.app: best for non-developers who want it clean and private

Best for: anyone who wants clean Markdown in seconds, without a command line or an upload.

This is mine, so weigh it accordingly. The idea is to do the hard parts (tables, formulas rendered with real math typesetting, images, stripping page headers and footers) entirely in your browser, so the file never leaves your device. The part I care most about: you see the original PDF and the Markdown side by side, and when a page is hard to read cleanly, like a scanned page with no real text layer, it tells you up front rather than quietly handing you garbage. So you can check it before you paste it somewhere.

▶ Try it live at pdfmarkdown.app — drop in a PDF and watch it turn into Markdown side by side: the original on the left, the generated Markdown on the right.

Strengths: runs in the browser (private, no signup, free), keeps tables and formulas readable, shows you the result side-by-side, honest about scanned / hard pages instead of faking them.

Weaknesses: it's a web app, not a scriptable library; if you want to batch thousands of files in a pipeline, an open-source tool fits better. Formulas mostly come through as real math, but the occasional one still trips it up. And very hard scanned documents are hard for everyone, me included.

MarkItDown: best free tool for developers prepping files for an LLM

Best for: developers who want a quick, free way to turn many file types into Markdown for an LLM.

Microsoft's open-source MarkItDown is a Python library and CLI that converts PDFs (plus Office files, images, audio and more) into Markdown aimed squarely at language models. It's fast, free, and trivial to drop into a script.

Strengths: open-source, handles many formats, made for LLM input, easy to automate.

Weaknesses: it's a library, so there's no UI and no preview; you don't see problems until later. Complex tables, dense math and scanned pages are basic compared with the heavier extractors below.

Marker: best open-source quality for complex PDFs

Best for: developers who want the highest-fidelity open-source conversion and can run Python.

Marker is one of the strongest open-source PDF→Markdown converters: it handles tables, equations and images well, restores reading order, and can optionally use an LLM to boost accuracy.

Strengths: excellent extraction quality, good with equations and tables, actively developed.

Weaknesses: real setup: Python, and ideally a GPU for speed. It's a developer tool, not something you'd hand a non-technical colleague.

Docling: best for RAG and document pipelines

Best for: teams building retrieval-augmented generation (RAG) or structured document workflows.

IBM's open-source Docling focuses on document understanding: clean structure, solid tables, and exports designed to feed downstream AI pipelines. If your endpoint is a vector database rather than a human reader, it's a strong fit.

Strengths: structured output, good tables, pipeline- and RAG-oriented, open-source.

Weaknesses: developer-oriented; overkill if you just want to read one PDF as Markdown.

Mathpix: best for heavy math and scientific papers

Best for: scientific and technical documents that are mostly equations, or even handwriting.

Mathpix is the specialist for math. Its OCR for formulas, including handwritten ones, is best in class, which makes it the go-to for STEM papers and problem sets.

Strengths: outstanding formula and scientific OCR, handles handwriting, polished.

Weaknesses: commercial and paid, with usage limits on the free tier; narrower than a general converter if your documents are mostly prose and tables.

CloudConvert & general web converters: best for the occasional job

Best for: a one-off conversion where you don't need perfect fidelity.

General converters like CloudConvert handle dozens of formats including PDF→Markdown. They're convenient when you already use them for other conversions.

Strengths: convenient, many formats, no install.

Weaknesses: it's built for shuffling file formats, not for document fidelity. In my testing, images were dropped entirely and most tables and formulas came out garbled. Files are also uploaded to a server (a privacy consideration for sensitive documents), and volume is gated by credits or limits.

A note on Pandoc, Adobe, and heavier tools

A few names that come up a lot:

Pandoc is the universal document converter, but it goes from Markdown to other formats far better than the reverse; it isn't really built to read an arbitrary PDF into clean Markdown. For Markdown → PDF it's excellent; for PDF → Markdown, look elsewhere.
Adobe (Acrobat and the PDF Services API) extracts accurately and is built for enterprises. The API has a free tier, but it's developer- and business-oriented, aimed at production workflows rather than a quick one-off conversion.
The developer heavyweights (MinerU, LlamaParse and Mistral OCR) are increasingly used in serious RAG and document pipelines. I didn't make them main picks because this guide leans toward simpler, no-setup options, but if you're building a production pipeline they're worth evaluating.

How to choose

A quick decision guide:

If you are…	Start with
A non-developer who wants it clean, private and fast	pdfmarkdown.app or a general web tool
A developer prepping files for an LLM, fast	MarkItDown
A developer who needs the best open-source quality	Marker
Building a RAG / document pipeline	Docling
Working mostly with heavy math or handwriting	Mathpix
Doing a one-off, mixed-format conversion	CloudConvert

Frequently asked questions

What's the best free PDF to Markdown tool?
For non-developers, a browser-based tool like pdfmarkdown.app is free and needs no signup. For developers, MarkItDown, Marker and Docling are all free and open-source, though Marker's license carries some commercial-use conditions worth checking before you ship it in a product.

Which PDF to Markdown tool keeps tables and formulas intact?
Tables and formulas are exactly where most tools fail. Among open-source options, Marker handles them best; for browser use, pdfmarkdown.app renders real math and keeps tables readable; for math-heavy documents specifically, Mathpix leads.

Is it safe to convert a confidential PDF online?
It depends on the tool. Most web converters upload your file to a server. Browser-based tools like pdfmarkdown.app do the work on your own device, so the file never leaves it. That's the safer choice for sensitive documents.

What's the best PDF to Markdown tool for RAG?
For retrieval-augmented generation, Docling and Marker are built for structured, pipeline-friendly output. MarkItDown is a lighter, faster option when you just need usable Markdown quickly.

I'm Jerome, the builder of pdfmarkdown.app, a free, browser-based PDF↔Markdown tool. I included direct competitors and tried to credit each one fairly. If you think I got a call wrong, tell me at hey@pdfmarkdown.app.