<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jerome</title>
    <description>The latest articles on DEV Community by Jerome (@jeromebuilds).</description>
    <link>https://dev.to/jeromebuilds</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3975376%2Fb318fdf9-ab53-4402-9e90-936f6344423e.png</url>
      <title>DEV Community: Jerome</title>
      <link>https://dev.to/jeromebuilds</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeromebuilds"/>
    <language>en</language>
    <item>
      <title>How to extract figures from a PDF without breaking them</title>
      <dc:creator>Jerome</dc:creator>
      <pubDate>Sat, 20 Jun 2026 00:08:00 +0000</pubDate>
      <link>https://dev.to/jeromebuilds/how-to-extract-figures-from-a-pdf-without-breaking-them-g8l</link>
      <guid>https://dev.to/jeromebuilds/how-to-extract-figures-from-a-pdf-without-breaking-them-g8l</guid>
      <description>&lt;p&gt;Almost everyone who runs a PDF through our converter does the same thing next: they hand the result to an AI. They paste it into ChatGPT, drop it into NotebookLM, or save it to a notes app that reads and summarizes for them. So that is the job I hold us to. Not "produce a Markdown file." Produce a Markdown file an AI can actually read without losing the plot.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; I build &lt;a href="https://pdfmarkdown.app?utm_source=devto" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;, an in-browser PDF→Markdown converter, so I have a horse in this race. The examples below are real, run on real papers, and the claims are checkable. Test anything here yourself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That goal gives us three rules we try hard not to break:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI-friendly first.&lt;/strong&gt; The output is going to be read by a machine, so anything that carries meaning has to survive. If a person glances at it and it looks roughly right, that is not enough. The AI reads what is actually there, not what you assume is there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lose nothing.&lt;/strong&gt; Keep the original information, including the small stuff. The tick numbers on a chart. The fact that two pieces are really one figure. The caption that belongs to it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep it checkable.&lt;/strong&gt; You should be able to look at the result and trust it, or spot quickly where it went wrong.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Easy to write down. The place those rules get tested hardest is figures, especially in academic papers. So let me show you what goes wrong, using a paper you probably know, and then how to get it right.&lt;/p&gt;

&lt;h2&gt;
  
  
  A chart that looks fine and isn't
&lt;/h2&gt;

&lt;p&gt;I ran the ResNet paper through a converter recently, to read it alongside an AI. One of its charts came out looking fine at a glance. Then I went to read the actual numbers off it, and there weren't any. The plotted lines were there, the legend was there, but every number on both axes had quietly vanished. No error rate up the side, no iteration count along the bottom.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0y5nbu795jmuf6y1h6an.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0y5nbu795jmuf6y1h6an.png" alt="Top: the ResNet chart as it appears in the paper, with axis numbers and labels. Bottom: the same chart after a typical extraction, with all axis labels gone." width="799" height="568"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In the paper, the chart has a scale and units. After a typical extraction, the lines survive and everything that tells you what they mean is gone.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To your eye, skimming, you might not even notice. To an AI, that chart is now close to noise. It sees a few lines drifting downward and has no idea down to what, or over how long. The information that made it a chart is gone, and nothing flagged that it left. That is the worst kind of error: the silent one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Figures break in two ways
&lt;/h2&gt;

&lt;p&gt;Once we started pulling on this, figures turn out to break in two distinct ways when a PDF becomes text. They have different causes, so they are worth naming separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One, information goes missing.&lt;/strong&gt; The figure is there, but text inside it disappears: axis numbers, axis names, labels on a diagram. That is the chart above.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two, the layout falls apart.&lt;/strong&gt; What you see as one figure comes out as several disconnected pieces, often stacked in the wrong order, with the caption attached to only one of them. Here is the attention diagram from the Transformers paper. You know it as a single figure. A typical extraction returns it as two separate images, one piled on top of the other, because the file never said they were one picture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F881gepex4lyafo1r8ai0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F881gepex4lyafo1r8ai0.png" alt="Before: one figure returned as two separate stacked images. After: the same figure kept whole, with both panels and labels." width="792" height="1609"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One figure, stored as two images with nothing linking them, put back together as one.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it happens
&lt;/h2&gt;

&lt;p&gt;None of this is the paper's fault, and it is not really the converter being lazy either. It comes down to what a PDF actually is.&lt;/p&gt;

&lt;p&gt;A PDF is &lt;a href="https://pdfmarkdown.app/blog/why-ai-cant-read-your-pdfs?utm_source=devto" rel="noopener noreferrer"&gt;a picture of a page, not the page's meaning&lt;/a&gt;. It records where every mark sits, and that is mostly all it records. It does not say "this is a figure," or "these digits are the scale on an axis," or "these two images are one picture." Put plainly: a PDF doesn't know it has a chart. We checked both of these papers, and none of that meaning is written down anywhere in the file. So whatever rebuilds the page has to work it out from a pile of marks.&lt;/p&gt;

&lt;p&gt;That explains both failures.&lt;/p&gt;

&lt;p&gt;The missing labels are a font story. Most of the text on that ResNet page survives fine, because the file carries the fonts it needs. But the axis numbers happen to use one of a small set of "standard" fonts that the PDF format assumes every reader already has, so the file does not bother to include it. We rebuild each figure privately, right inside your browser, so your document is never uploaded anywhere. In that private setting there is no copy of that one standard font to draw from, and those particular characters come out blank. It is like a recipe that says "add the house spice blend." Fine in the kitchen that mixed it, useless to you at home with a different set of jars. Everything else on the page kept its own fonts, which is exactly why only the axis numbers went missing.&lt;/p&gt;

&lt;p&gt;The split figure is simpler. The file stored that one diagram as two separate images and never linked them. A quick pass treats them as two figures, stacks them, and pins the caption to one. It is a jigsaw tipped out with no picture on the lid and no hint about which pieces make which image.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to get figures out whole
&lt;/h2&gt;

&lt;p&gt;So the fix is to do the extra work the file skips. For the missing text, we catch the labels that are about to disappear and draw them back in, in the right place and at the right size, before the figure is saved. For the split figure, we look at where the pieces sit, work out that they are one picture, put them back together, and reattach the caption. The fixed chart at the top of this page and the merged figure above are both the real results, run on the real papers.&lt;/p&gt;

&lt;p&gt;One thing I care about: we only fix what is broken. For figures that were already complete, we change nothing at all. They come out identical to before, down to the byte. The job is to restore what the PDF dropped, not to repaint things that were already right.&lt;/p&gt;

&lt;p&gt;And if you do not even want Markdown, if you just want the figures themselves as clean images, you can take those too. They come out whole, labels and all.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to tell if your converter is breaking figures
&lt;/h2&gt;

&lt;p&gt;You do not need to take my word for it. Whatever tool you use, run this quick check on a PDF that has charts or multi-part diagrams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Open a chart in the result and look for the numbers.&lt;/strong&gt; Are the axis labels and tick numbers still there? If the lines survived but the scale did not, the chart is decorative now, not data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Find a figure that is really two pictures side by side.&lt;/strong&gt; Did it come out as one figure with its caption, or did it fall apart into separate images?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check a diagram's inner labels.&lt;/strong&gt; Boxes and arrows with no words are just boxes and arrows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That one look tells you more about a tool's fidelity than any feature list. (If you want a starting point, I keep a &lt;a href="https://pdfmarkdown.app/blog/best-pdf-to-markdown-tools?utm_source=devto" rel="noopener noreferrer"&gt;running comparison of PDF-to-Markdown tools&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;This is what high fidelity means to us in practice. Not a slogan about quality, but a stubborn refusal to let your document quietly lose pieces of itself on the way to an AI. Charts keep their axes. Figures stay whole. The caption stays with its figure.&lt;/p&gt;

&lt;p&gt;You can &lt;a href="https://pdfmarkdown.app?utm_source=devto" rel="noopener noreferrer"&gt;try ours&lt;/a&gt; on your own file right now. It runs entirely in your browser, your document is never uploaded, and your figures keep their labels and their shape.&lt;/p&gt;

</description>
      <category>pdf</category>
      <category>ai</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Convert a Markdown File to PDF (Pandoc, VS Code, or Just Your Browser)</title>
      <dc:creator>Jerome</dc:creator>
      <pubDate>Fri, 19 Jun 2026 06:57:17 +0000</pubDate>
      <link>https://dev.to/jeromebuilds/how-to-convert-a-markdown-file-to-pdf-pandoc-vs-code-or-just-your-browser-19ni</link>
      <guid>https://dev.to/jeromebuilds/how-to-convert-a-markdown-file-to-pdf-pandoc-vs-code-or-just-your-browser-19ni</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://pdfmarkdown.app/blog/how-to-convert-markdown-to-pdf" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Markdown quietly became the default writing format of the AI era. ChatGPT and Claude answer in it, every README and wiki is written in it, Obsidian and Notion notes live in it. The search numbers say people have noticed: Google searches for "markdown to pdf" are up roughly 10× over the past year, and "md file to pdf" more than 20×.&lt;/p&gt;

&lt;p&gt;The funny part is what happens when that Markdown has to leave your ecosystem. Send a raw &lt;code&gt;.md&lt;/code&gt; file to a client or a manager and what they see is programmer scribbles, asterisks and pound signs included. For all its ubiquity, Markdown still has no good way to just &lt;em&gt;share&lt;/em&gt; a document and trust it will look right on the other person's screen, especially a phone. So we do what people have always done: flatten it into a PDF, the one format that renders the same everywhere.&lt;/p&gt;

&lt;p&gt;And then finding a tool for &lt;em&gt;that&lt;/em&gt; turns out to be its own little ordeal, which is why you're reading this. There are three good ways to do the conversion, and the right one depends on how often you do it and how much you like terminals. I'll walk through all three, including the parts that bite.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; the third option is mine. I build &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;, which includes a browser-based &lt;a href="https://pdfmarkdown.app/markdown-to-pdf" rel="noopener noreferrer"&gt;Markdown to PDF converter&lt;/a&gt;. I've tried to be fair to the other two; both are genuinely good at what they do, and I use Pandoc myself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;You script things and want full control?&lt;/strong&gt; Pandoc. The most powerful option and the only sane one for converting hundreds of files, at the cost of a LaTeX install measured in gigabytes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You live in VS Code and convert occasionally?&lt;/strong&gt; The Markdown PDF extension. It's right there in your editor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You just want a clean PDF now, with nothing to install?&lt;/strong&gt; A browser tool. Mine is a free, &lt;a href="https://pdfmarkdown.app/markdown-to-pdf" rel="noopener noreferrer"&gt;browser-based Markdown to PDF tool&lt;/a&gt;: paste your Markdown and download the PDF when it looks right.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Option 1: Pandoc, the command-line workhorse
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://pandoc.org/" rel="noopener noreferrer"&gt;Pandoc&lt;/a&gt; converts basically any document format into any other. For Markdown to PDF, the basic command is one line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pandoc notes.md &lt;span class="nt"&gt;-o&lt;/span&gt; notes.pdf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that worked on the first try on a fresh machine, you got lucky. The usual greeting is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;'pdflatex' not found. Please select a different --pdf-engine or install 'pdflatex'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the catch nobody mentions up front: Pandoc doesn't make PDFs by itself. By default it hands the work to LaTeX, which you install separately, and the full distributions (TeX Live on Linux, MacTeX on macOS) run to several gigabytes. If that sounds absurd for converting some notes, &lt;a href="https://yihui.org/tinytex/" rel="noopener noreferrer"&gt;TinyTeX&lt;/a&gt; is a much smaller distribution built for exactly this situation.&lt;/p&gt;

&lt;p&gt;Once it runs, the default look is distinctly academic: the Computer Modern serif of a classic LaTeX paper. It's not ugly (that's a respected, very readable typeface), just formal in a way that can feel out of place in a quick note to a non-technical colleague. The tables, for the record, come out clean. A few flags steer it toward something more everyday:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pandoc notes.md &lt;span class="nt"&gt;-o&lt;/span&gt; notes.pdf &lt;span class="nt"&gt;-V&lt;/span&gt; geometry:margin&lt;span class="o"&gt;=&lt;/span&gt;1in &lt;span class="nt"&gt;-V&lt;/span&gt; &lt;span class="nv"&gt;fontsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;12pt &lt;span class="nt"&gt;--toc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;-V&lt;/code&gt; sets layout variables like margins and font size, and &lt;code&gt;--toc&lt;/code&gt; adds a table of contents.&lt;/p&gt;

&lt;p&gt;The second classic trap is any character beyond plain English. Feed the default engine CJK text (Chinese, Japanese, Korean) or Cyrillic and it doesn't quietly drop it, it halts outright with &lt;code&gt;Unicode character 中 (U+4E2D) not set up for use with LaTeX&lt;/code&gt;. The fix is to switch to the &lt;code&gt;xelatex&lt;/code&gt; engine and name a font that contains your glyphs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pandoc notes.md &lt;span class="nt"&gt;-o&lt;/span&gt; notes.pdf &lt;span class="nt"&gt;--pdf-engine&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;xelatex &lt;span class="nt"&gt;-V&lt;/span&gt; &lt;span class="nv"&gt;CJKmainfont&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Songti SC"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two gotchas I hit running exactly this on a fresh TinyTeX. First, you need the CJK package: &lt;code&gt;tlmgr install xecjk&lt;/code&gt;. Second, not every font name resolves. macOS's own &lt;code&gt;PingFang SC&lt;/code&gt; would not load for me (xelatex couldn't find it), while &lt;code&gt;Songti SC&lt;/code&gt; worked; on Windows try &lt;code&gt;Microsoft YaHei&lt;/code&gt;, on Linux &lt;code&gt;Noto Sans CJK SC&lt;/code&gt;. And emoji? In my testing they vanish even after all of this, so don't count on them.&lt;/p&gt;

&lt;p&gt;The upside to all this fiddling is that it's front-loaded. Once the install and the font flags are sorted, the setup keeps working: the same command converts the same way next week and next month, so you pay the tax once and then mostly forget it's there. And once configured, Pandoc is unbeatable for repetition. This converts a whole folder:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;f &lt;span class="k"&gt;in &lt;/span&gt;docs/&lt;span class="k"&gt;*&lt;/span&gt;.md&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do &lt;/span&gt;pandoc &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;f&lt;/span&gt;&lt;span class="p"&gt;%.md&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.pdf"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Markdown to PDF is part of a build pipeline or a nightly job, learn Pandoc and don't look back.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 2: VS Code, if you're already sitting in it
&lt;/h2&gt;

&lt;p&gt;Install the &lt;strong&gt;Markdown PDF&lt;/strong&gt; extension (the popular one is by yzane), open your file, right-click in the editor, and pick "Markdown PDF: Export (pdf)". That's the whole workflow, which is exactly the appeal.&lt;/p&gt;

&lt;p&gt;Under the hood it prints the page with a headless Chromium browser, which the extension downloads on first use, so expect the first export to take a while. The browser engine is good news for output quality though: you get familiar GitHub-style rendering and code highlighting without configuring anything.&lt;/p&gt;

&lt;p&gt;The friction shows up when you want it to look different. Custom styling means writing CSS files and pointing the &lt;code&gt;markdown-pdf.styles&lt;/code&gt; setting at them, and controlling where pages break means adding CSS rules like &lt;code&gt;page-break-after&lt;/code&gt; to your document. Converting a pile of files is also awkward, since everything is built around the editor's one-file-at-a-time flow.&lt;/p&gt;

&lt;p&gt;For the occasional "send this doc to someone" moment while you're coding anyway, it's the path of least resistance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Option 3: your browser, when you just want the PDF
&lt;/h2&gt;

&lt;p&gt;This is the one I built. &lt;a href="https://pdfmarkdown.app/markdown-to-pdf" rel="noopener noreferrer"&gt;pdfmarkdown.app/markdown-to-pdf&lt;/a&gt; runs in your browser: paste your Markdown (or drop a &lt;code&gt;.md&lt;/code&gt; file, or a &lt;code&gt;.zip&lt;/code&gt; of Markdown plus its images) and the pages build live in front of you, exactly as they'll export. Free, no signup.&lt;/p&gt;

&lt;p&gt;The part I obsessed over is page breaks. The classic failure of quick converters is a table sliced in half across a page edge, or a heading stranded alone at the bottom of page 3 while its section starts on page 4. Here the layout keeps tables, code blocks and figures whole, and because the preview is the actual paginated document, you see any problem before you download rather than after. Long code lines wrap inside the block instead of running off the right edge (a place pandoc's defaults will spill on you), math renders properly, and there are five themes (Clean, Editorial, Academic, Compact, Technical) to match the document to its reader.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn87suzfzcipe4uaa1azt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn87suzfzcipe4uaa1azt.png" alt="Two pages from the browser tool side by side: a code block that doesn't fit at the bottom of page 1 is left out and moved whole to the top of page 2, instead of being split across the page edge." width="800" height="578"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Smart page breaks in action: a block that won't fit is moved whole to the next page rather than sliced across the page edge. Tables, code and figures stay intact.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fecvx74291q9r7etngj0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fecvx74291q9r7etngj0p.png" alt="A PDF exported from the browser tool: a typeset integral equation above a dark code block where a long comment line wraps inside the block instead of overflowing the page." width="800" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Exported straight from the browser: math typeset properly, and a long code line wrapped inside the block instead of spilling off the page edge.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One thing I built specifically for the share-it-on-a-phone case from the top of this post: a &lt;strong&gt;Phone page size&lt;/strong&gt;. Most PDFs are A4 or Letter, which on a phone means tiny pinch-to-zoom text. The Phone size lays the page out tall and narrow so the text comes out big and readable on a phone screen with no zooming, which is often exactly the device the person you're sending it to is holding.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0pswcvutpzgtuhjhv5in.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0pswcvutpzgtuhjhv5in.png" alt="The Markdown to PDF tool with the page-size selector set to Phone, and a live preview rendering Chinese, Japanese, Russian, German and French text with color emoji on a narrow, phone-shaped page." width="800" height="435"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Switch the page size to Phone for a tall, narrow PDF that reads on a phone without zooming. Whatever the script (CJK, Cyrillic, accented Latin) plus emoji, it renders with no font setup.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Plenty of other web converters do Markdown to PDF, from the long-running &lt;a href="https://www.markdowntopdf.com/" rel="noopener noreferrer"&gt;markdowntopdf.com&lt;/a&gt; to a steady stream of newer ones. If you go that route, one piece of advice from reading a year's worth of user threads while researching this space: judge the exported file, not the preview. The most common complaint about web converters, by far, is a beautiful preview that exports to a broken PDF, with bold text gone, links dead, or CJK and emoji missing. That's exactly why the preview here &lt;em&gt;is&lt;/em&gt; the paginated document: what you see is what downloads.&lt;/p&gt;

&lt;p&gt;Honest boundaries: it's a web page, not a pipeline. If you need two hundred files converted on a schedule, that's Pandoc territory today. Browser-based batch is on my mind, though, and so are other gaps (Mermaid diagrams, say). If there's something you'd use that it doesn't do yet, &lt;a href="mailto:hey@pdfmarkdown.app?subject=Markdown%20to%20PDF"&gt;tell me what you need&lt;/a&gt; and the real use cases are what move it up the list. And if you're going the &lt;em&gt;other&lt;/em&gt; direction, turning a PDF into Markdown, that's the main thing &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt; does.&lt;/p&gt;
&lt;h2&gt;
  
  
  Already writing in Obsidian or Typora?
&lt;/h2&gt;

&lt;p&gt;Then you may not need a converter at all. Both can export the current document to PDF directly (in Obsidian it's the "Export to PDF" command), and for a quick whole-document export that's usually enough. The ceiling is control: Obsidian exports the entire note whether you want all of it or not, and in both, fine-tuning the look or the page breaks means digging into custom CSS. When you hit that ceiling, the three routes above give you more room.&lt;/p&gt;
&lt;h2&gt;
  
  
  Turning a README (or any GitHub doc) into a PDF
&lt;/h2&gt;

&lt;p&gt;This one comes up constantly: a &lt;code&gt;README.md&lt;/code&gt; or a docs folder has to go to a client or an auditor who would be confused by a GitHub link. GitHub has no export-to-PDF button, so you have two options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With Pandoc&lt;/strong&gt;, tell it the input is GitHub-flavored Markdown so tables and task lists survive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pandoc README.md &lt;span class="nt"&gt;-f&lt;/span&gt; gfm &lt;span class="nt"&gt;-o&lt;/span&gt; README.pdf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;In the browser&lt;/strong&gt;, paste the raw file into &lt;a href="https://pdfmarkdown.app/markdown-to-pdf" rel="noopener noreferrer"&gt;pdfmarkdown.app/markdown-to-pdf&lt;/a&gt;. If the README references local images, zip the folder and drop the zip in so the images resolve.&lt;/p&gt;

&lt;p&gt;Either way, consider deleting the badge row first (the little build-status shields at the top). Badges are made for repo pages and rarely make sense in a document.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How do I convert a Markdown file to PDF without installing anything?&lt;/strong&gt;&lt;br&gt;
Use a free online converter that runs in your browser. &lt;a href="https://pdfmarkdown.app/markdown-to-pdf" rel="noopener noreferrer"&gt;pdfmarkdown.app/markdown-to-pdf&lt;/a&gt; needs no signup, and shows you the paginated result live before you download it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I convert a Markdown table to PDF without it breaking?&lt;/strong&gt;&lt;br&gt;
Tables are where most converters stumble: wide ones get their right edge cut off, or the rows collapse into a mess on export. Pandoc handles them well if you pass &lt;code&gt;-f gfm&lt;/code&gt;; in the browser, &lt;a href="https://pdfmarkdown.app/markdown-to-pdf" rel="noopener noreferrer"&gt;pdfmarkdown.app/markdown-to-pdf&lt;/a&gt; keeps each table whole and won't slice one across a page edge. Whatever you use, judge the downloaded file, not the on-screen preview.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does Pandoc fail with "pdflatex not found"?&lt;/strong&gt;&lt;br&gt;
Pandoc delegates PDF generation to a LaTeX engine that isn't installed yet. Install a TeX distribution (TinyTeX if you want small, TeX Live or MacTeX if you want complete), or point &lt;code&gt;--pdf-engine&lt;/code&gt; at an engine you already have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I convert a GitHub README to PDF?&lt;/strong&gt;&lt;br&gt;
GitHub itself can't do it. Either run &lt;code&gt;pandoc README.md -f gfm -o README.pdf&lt;/code&gt; on the command line (the &lt;code&gt;-f gfm&lt;/code&gt; flag keeps GitHub-style tables intact), or paste the raw Markdown into a browser converter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the best way to batch convert many Markdown files to PDF?&lt;/strong&gt;&lt;br&gt;
Pandoc in a shell loop: &lt;code&gt;for f in *.md; do pandoc "$f" -o "${f%.md}.pdf"; done&lt;/code&gt;. Browser tools and editor extensions are built around one document at a time.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Jerome, the builder of &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;, a free, browser-based PDF↔Markdown tool. Two of the three options above aren't mine, and I genuinely reach for Pandoc when I'm batch-converting. If I got something wrong, tell me at &lt;a href="mailto:hey@pdfmarkdown.app"&gt;hey@pdfmarkdown.app&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>markdown</category>
      <category>pdf</category>
      <category>pandoc</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Your AI agent can't grep a PDF, and it's burning your tokens 🔥</title>
      <dc:creator>Jerome</dc:creator>
      <pubDate>Fri, 12 Jun 2026 14:41:17 +0000</pubDate>
      <link>https://dev.to/jeromebuilds/your-ai-agent-cant-grep-a-pdf-and-its-burning-your-tokens-1532</link>
      <guid>https://dev.to/jeromebuilds/your-ai-agent-cant-grep-a-pdf-and-its-burning-your-tokens-1532</guid>
      <description>&lt;p&gt;Your coding agent can &lt;code&gt;grep&lt;/code&gt; your whole repo in milliseconds. It can't treat a PDF the same way.&lt;/p&gt;

&lt;p&gt;A PDF is not AI-friendly by default. Even when it contains selectable text, the structure that matters to an agent often gets lost or has to be guessed back: reading order, tables, formulas, captions, and figures. That extraction is lossy, and it is not free.&lt;/p&gt;

&lt;p&gt;Here's what's going on under the hood, and why converting once to clean Markdown is the fix.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclosure:&lt;/strong&gt; I build &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;, an in-browser PDF→Markdown converter, so weigh that accordingly. I've kept the claims checkable, so test them yourself.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  A PDF is a picture, not text
&lt;/h2&gt;

&lt;p&gt;Most PDFs don't store your sentences. They store where each glyph sits on the page. "Married" might be saved as a run of positioned glyphs with no record that they form a word, that the word belongs to that paragraph, or that the left column should be read before the right one. (Tagged PDFs &lt;em&gt;can&lt;/em&gt; carry logical structure and reading order, but in the wild they're rare or unreliable, so tools can't count on them.)&lt;/p&gt;

&lt;p&gt;A human eye reassembles all that instantly. Software has to &lt;em&gt;guess&lt;/em&gt; it back, and that guessing is where things break.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b7ockd3nyi95ukbz8fg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b7ockd3nyi95ukbz8fg.png" alt="A PDF stores letters as scattered x,y coordinates with no order; Markdown stores them as ordered, structured lines." width="799" height="333"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A PDF knows where each glyph sits, not the order it should be read in. Markdown stores the order and the structure, which is exactly what a model needs.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The five places it breaks
&lt;/h2&gt;

&lt;p&gt;When a converter (or a model) takes that guess, five things tend to fall apart, and they're the parts that carry the actual meaning:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhngi9sfk5zvjoozewwij.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhngi9sfk5zvjoozewwij.png" alt="Five ways PDFs break for AI: scanned pages are pure images, multi-column reading order scrambles, tables collapse into one line, formulas turn to gibberish, and figures get dropped." width="799" height="373"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The five breakpoints: scanned pages, multi-column order, tables, formulas, and images.&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scanned pages are just images.&lt;/strong&gt; No text layer at all. Without OCR, the model "sees" a photo and quietly makes things up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-column pages read in the wrong order.&lt;/strong&gt; A two-column paper gets stitched left-half-line then right-half-line, so sentences interleave into nonsense.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tables collapse.&lt;/strong&gt; Rows and columns flatten into one run-on line. The number that was under "2024" ends up floating next to a label from a different row.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Formulas turn to gibberish.&lt;/strong&gt; &lt;code&gt;E = mc²&lt;/code&gt; becomes &lt;code&gt;E mc2&lt;/code&gt;, subscripts and superscripts drift, and an equation the paper is &lt;em&gt;about&lt;/em&gt; becomes unreadable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Figures lose their meaning.&lt;/strong&gt; A chart gets dropped, or at best pulled out as a bare image. In a text or Markdown pipeline (RAG, search, an agent grepping over text), that image carries no meaning. A vision model could look at it, but your retrieval index and your &lt;code&gt;grep&lt;/code&gt; can't.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The fix: clean Markdown is the format AI actually reads well
&lt;/h2&gt;

&lt;p&gt;Markdown is plain text with light, explicit structure: &lt;code&gt;#&lt;/code&gt; for headings, real rows and columns for tables, fenced blocks for code. The plainness is the whole point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The structure is &lt;strong&gt;stated, not guessed.&lt;/strong&gt; The reading order, the table shape and the hierarchy are all written down.&lt;/li&gt;
&lt;li&gt;It's &lt;strong&gt;greppable and token-cheap.&lt;/strong&gt; It's plain text, so an agent can search it line by line, and there's no binary cruft for a model to wade through.&lt;/li&gt;
&lt;li&gt;Models were &lt;strong&gt;trained on mountains of it&lt;/strong&gt; (every README, every wiki, every docs site), so they parse it natively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Convert the PDF &lt;em&gt;once&lt;/em&gt; into clean Markdown and you've done the hard, lossy extraction a single time, deliberately, instead of making every tool redo it (badly) on every query.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where llms.txt fits in
&lt;/h3&gt;

&lt;p&gt;This is the same idea behind &lt;strong&gt;&lt;code&gt;llms.txt&lt;/code&gt;&lt;/strong&gt;, an emerging convention where a site publishes a plain-Markdown map of its important content so AI tools can read it directly, instead of fighting through rendered HTML or PDFs. &lt;em&gt;If you want AI to read something, hand it clean Markdown.&lt;/em&gt; A PDF on your disk and a webpage an AI crawls have the exact same problem, and the exact same fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turning a PDF into AI-ready Markdown: what to watch
&lt;/h2&gt;

&lt;p&gt;If you convert a PDF, judge the result on the parts that actually break, not on whether the first paragraph looks fine. Check four things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Did the tables survive&lt;/strong&gt; as real rows and columns?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Did the formulas survive&lt;/strong&gt; as readable math?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Were scanned pages&lt;/strong&gt; recognized, or silently handed back as garbage?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Did the figures make it&lt;/strong&gt; into the output at all?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the bar I hold &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt; to: it runs in your browser, shows you the original PDF and the Markdown side by side so you can check those four things before you trust the output, and when a page is genuinely hard (a scan with no text layer) it says so up front instead of faking it. It's a floor I can show you, not a "perfect conversion" promise, because nobody can honestly make that one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbavtosd36gjxosbos61u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbavtosd36gjxosbos61u.png" alt="pdfmarkdown.app showing a PDF and its converted Markdown side by side, with the figure, caption and equation preserved." width="799" height="516"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Original PDF on the left, generated Markdown on the right. This is the&lt;/em&gt; Attention Is All You Need &lt;em&gt;paper: the figure keeps its caption, and the equation comes through as real math.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  "But models keep getting smarter, won't this just go away?"
&lt;/h2&gt;

&lt;p&gt;Maybe the &lt;em&gt;accuracy&lt;/em&gt; improves. Two things don't, and they get &lt;strong&gt;more&lt;/strong&gt; important as agents take over, not less:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tokens.&lt;/strong&gt; A PDF has to be parsed into text before a model can do anything with it. In the naive pattern (attach the PDF to each chat) you re-pay that parse on every turn. Prompt caching and RAG soften it, but they're working around the same root cause: the PDF was never text to begin with. Convert it once to Markdown and the parse is done for good: cheap to embed, cheap to search, cheap to ask about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Agents read on demand.&lt;/strong&gt; Claude Code and Codex don't slurp whole files into context; they &lt;code&gt;grep&lt;/code&gt; and search for the few lines they need, when they need them. A PDF can't be searched that way without first extracting it to text, which is exactly "convert it to Markdown." Do it once and your agent treats it like any other file in the repo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fizk0nq6h01s30o2n2sum.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fizk0nq6h01s30o2n2sum.png" alt="An agent greps a Markdown file and pulls only the three relevant lines; with a PDF it has to extract the whole document to text first." width="800" height="347"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;How an agent actually reads: Markdown lets it pull the three lines it needs. A PDF has to be decoded whole before it can search at all.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So the trend runs opposite to the intuition. As AI shifts from &lt;em&gt;chatting with one document&lt;/em&gt; to &lt;em&gt;agents navigating a whole library of code and docs&lt;/em&gt;, the PDF becomes a bigger bottleneck, not a smaller one. Better models make the agent pattern more common, which makes clean Markdown more necessary, not less.&lt;/p&gt;

&lt;h2&gt;
  
  
  "I just keep my PDFs in Obsidian, do I still need this?"
&lt;/h2&gt;

&lt;p&gt;Especially then. A vault lives or dies on what you can search, link and fold into other notes, and a raw PDF sitting in it is a dead end: you can't &lt;code&gt;[[link]]&lt;/code&gt; to a heading inside it, can't pull one paragraph into a daily note, can't grep it. Convert it to Markdown and the PDF becomes a first-class note like everything else, readable by you and by any AI you point at your vault.&lt;/p&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Most PDFs store &lt;em&gt;where glyphs sit&lt;/em&gt;, not &lt;em&gt;what they say in what order&lt;/em&gt;, so anything reading one has to guess, and guesses worst on tables, formulas, multi-column pages, scans and figures.&lt;/li&gt;
&lt;li&gt;You can't &lt;code&gt;grep&lt;/code&gt; or embed a PDF until it's been extracted to text. Clean Markdown &lt;em&gt;is&lt;/em&gt; that text, with the structure intact: greppable, token-cheap, and what models read natively. &lt;code&gt;llms.txt&lt;/code&gt; is the same idea for the web.&lt;/li&gt;
&lt;li&gt;Smarter models don't retire the problem. Token cost and agent-style on-demand reading make converting-once-to-Markdown &lt;em&gt;more&lt;/em&gt; valuable over time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Convert a PDF to clean Markdown once, glance over it to confirm the tables and formulas came through, and from then on every tool, model and agent you hand it to reads the real thing instead of guessing at the original.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>markdown</category>
    </item>
    <item>
      <title>The Best PDF to Markdown Tools in 2026 (Honestly Compared)</title>
      <dc:creator>Jerome</dc:creator>
      <pubDate>Wed, 10 Jun 2026 14:23:24 +0000</pubDate>
      <link>https://dev.to/jeromebuilds/the-best-pdf-to-markdown-tools-in-2026-honestly-compared-1m0k</link>
      <guid>https://dev.to/jeromebuilds/the-best-pdf-to-markdown-tools-in-2026-honestly-compared-1m0k</guid>
      <description>&lt;p&gt;Turning a PDF into Markdown sounds simple until you try it on a real document. The text comes out fine. Then the tables collapse into mush, the formulas turn to gibberish, the figures vanish, and a two-column research paper reads in the wrong order. Markdown is how documents get fed to AI tools, pasted into notes, and stored in wikis, so "mostly right" usually isn't good enough.&lt;/p&gt;

&lt;p&gt;I compared the tools people actually reach for, judged on the parts that break: &lt;strong&gt;tables, formulas, images, scanned pages, reading order, and how much setup it takes to get there.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Upfront disclosure:&lt;/strong&gt; I'm the maker of &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;, one of the tools below — so factor that in. I've tried hard to be fair; every other tool here is genuinely good at something, and I say so. Check the claims yourself; tools change.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Just want clean Markdown without installing anything?&lt;/strong&gt; Use a browser tool like &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;: private, no signup, and you can see what you're getting before you trust it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A developer building a RAG or document pipeline?&lt;/strong&gt; Reach for an open-source library: &lt;strong&gt;Marker&lt;/strong&gt;, &lt;strong&gt;Docling&lt;/strong&gt;, or &lt;strong&gt;MarkItDown&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mostly heavy math, scientific papers, or handwriting?&lt;/strong&gt; &lt;strong&gt;Mathpix&lt;/strong&gt; is the specialist.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An occasional, mixed-format conversion?&lt;/strong&gt; A general converter like &lt;strong&gt;CloudConvert&lt;/strong&gt; is fine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's no single winner. The right pick depends on whether you live in a terminal, and what's actually in your PDFs.&lt;/p&gt;

&lt;h2&gt;
  
  
  pdfmarkdown.app: best for non-developers who want it clean and private
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; anyone who wants clean Markdown in seconds, without a command line or an upload.&lt;/p&gt;

&lt;p&gt;This is mine, so weigh it accordingly. The idea is to do the hard parts (tables, formulas rendered with real math typesetting, images, stripping page headers and footers) entirely in your browser, so the file never leaves your device. The part I care most about: you see the original PDF and the Markdown &lt;strong&gt;side by side&lt;/strong&gt;, and when a page is hard to read cleanly, like a scanned page with no real text layer, it tells you up front rather than quietly handing you garbage. So you can check it before you paste it somewhere.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn7fbcsh63nsva1j3891.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsn7fbcsh63nsva1j3891.gif" alt="Side-by-side Preview" width="600" height="271"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;▶ &lt;strong&gt;&lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;Try it live at pdfmarkdown.app&lt;/a&gt;&lt;/strong&gt; — drop in a PDF and watch it turn into Markdown side by side: the original on the left, the generated Markdown on the right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; runs in the browser (private, no signup, free), keeps tables and formulas readable, shows you the result side-by-side, honest about scanned / hard pages instead of faking them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; it's a web app, not a scriptable library; if you want to batch thousands of files in a pipeline, an open-source tool fits better. Formulas mostly come through as real math, but the occasional one still trips it up. And very hard scanned documents are hard for everyone, me included.&lt;/p&gt;

&lt;h2&gt;
  
  
  MarkItDown: best free tool for developers prepping files for an LLM
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; developers who want a quick, free way to turn many file types into Markdown for an LLM.&lt;/p&gt;

&lt;p&gt;Microsoft's open-source &lt;a href="https://github.com/microsoft/markitdown" rel="noopener noreferrer"&gt;MarkItDown&lt;/a&gt; is a Python library and CLI that converts PDFs (plus Office files, images, audio and more) into Markdown aimed squarely at language models. It's fast, free, and trivial to drop into a script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; open-source, handles many formats, made for LLM input, easy to automate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; it's a library, so there's no UI and no preview; you don't see problems until later. Complex tables, dense math and scanned pages are basic compared with the heavier extractors below.&lt;/p&gt;

&lt;h2&gt;
  
  
  Marker: best open-source quality for complex PDFs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; developers who want the highest-fidelity open-source conversion and can run Python.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/datalab-to/marker" rel="noopener noreferrer"&gt;Marker&lt;/a&gt; is one of the strongest open-source PDF→Markdown converters: it handles tables, equations and images well, restores reading order, and can optionally use an LLM to boost accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; excellent extraction quality, good with equations and tables, actively developed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; real setup: Python, and ideally a GPU for speed. It's a developer tool, not something you'd hand a non-technical colleague.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docling: best for RAG and document pipelines
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; teams building retrieval-augmented generation (RAG) or structured document workflows.&lt;/p&gt;

&lt;p&gt;IBM's open-source &lt;a href="https://github.com/docling-project/docling" rel="noopener noreferrer"&gt;Docling&lt;/a&gt; focuses on document &lt;em&gt;understanding&lt;/em&gt;: clean structure, solid tables, and exports designed to feed downstream AI pipelines. If your endpoint is a vector database rather than a human reader, it's a strong fit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; structured output, good tables, pipeline- and RAG-oriented, open-source.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; developer-oriented; overkill if you just want to read one PDF as Markdown.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mathpix: best for heavy math and scientific papers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; scientific and technical documents that are mostly equations, or even handwriting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://mathpix.com/" rel="noopener noreferrer"&gt;Mathpix&lt;/a&gt; is the specialist for math. Its OCR for formulas, including handwritten ones, is best in class, which makes it the go-to for STEM papers and problem sets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; outstanding formula and scientific OCR, handles handwriting, polished.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; commercial and paid, with usage limits on the free tier; narrower than a general converter if your documents are mostly prose and tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  CloudConvert &amp;amp; general web converters: best for the occasional job
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; a one-off conversion where you don't need perfect fidelity.&lt;/p&gt;

&lt;p&gt;General converters like &lt;a href="https://cloudconvert.com/" rel="noopener noreferrer"&gt;CloudConvert&lt;/a&gt; handle dozens of formats including PDF→Markdown. They're convenient when you already use them for other conversions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; convenient, many formats, no install.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; it's built for shuffling file formats, not for document fidelity. In my testing, images were dropped entirely and most tables and formulas came out garbled. Files are also uploaded to a server (a privacy consideration for sensitive documents), and volume is gated by credits or limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on Pandoc, Adobe, and heavier tools
&lt;/h2&gt;

&lt;p&gt;A few names that come up a lot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pandoc&lt;/strong&gt; is the universal document converter, but it goes &lt;em&gt;from&lt;/em&gt; Markdown &lt;em&gt;to&lt;/em&gt; other formats far better than the reverse; it isn't really built to read an arbitrary PDF into clean Markdown. For &lt;a href="https://pdfmarkdown.app/markdown-to-pdf" rel="noopener noreferrer"&gt;Markdown → PDF&lt;/a&gt; it's excellent; for PDF → Markdown, look elsewhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adobe&lt;/strong&gt; (Acrobat and the PDF Services API) extracts accurately and is built for enterprises. The API has a free tier, but it's developer- and business-oriented, aimed at production workflows rather than a quick one-off conversion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The developer heavyweights&lt;/strong&gt; (&lt;strong&gt;MinerU&lt;/strong&gt;, &lt;strong&gt;LlamaParse&lt;/strong&gt; and &lt;strong&gt;Mistral OCR&lt;/strong&gt;) are increasingly used in serious RAG and document pipelines. I didn't make them main picks because this guide leans toward simpler, no-setup options, but if you're building a production pipeline they're worth evaluating.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to choose
&lt;/h2&gt;

&lt;p&gt;A quick decision guide:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you are…&lt;/th&gt;
&lt;th&gt;Start with&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A non-developer who wants it clean, private and fast&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt; or a general web tool&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A developer prepping files for an LLM, fast&lt;/td&gt;
&lt;td&gt;MarkItDown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A developer who needs the best open-source quality&lt;/td&gt;
&lt;td&gt;Marker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Building a RAG / document pipeline&lt;/td&gt;
&lt;td&gt;Docling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Working mostly with heavy math or handwriting&lt;/td&gt;
&lt;td&gt;Mathpix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Doing a one-off, mixed-format conversion&lt;/td&gt;
&lt;td&gt;CloudConvert&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What's the best free PDF to Markdown tool?&lt;/strong&gt;&lt;br&gt;
For non-developers, a browser-based tool like &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt; is free and needs no signup. For developers, MarkItDown, Marker and Docling are all free and open-source, though Marker's license carries some commercial-use conditions worth checking before you ship it in a product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which PDF to Markdown tool keeps tables and formulas intact?&lt;/strong&gt;&lt;br&gt;
Tables and formulas are exactly where most tools fail. Among open-source options, Marker handles them best; for browser use, &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt; renders real math and keeps tables readable; for math-heavy documents specifically, Mathpix leads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is it safe to convert a confidential PDF online?&lt;/strong&gt;&lt;br&gt;
It depends on the tool. Most web converters upload your file to a server. Browser-based tools like &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt; do the work on your own device, so the file never leaves it. That's the safer choice for sensitive documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's the best PDF to Markdown tool for RAG?&lt;/strong&gt;&lt;br&gt;
For retrieval-augmented generation, Docling and Marker are built for structured, pipeline-friendly output. MarkItDown is a lighter, faster option when you just need usable Markdown quickly.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Jerome, the builder of &lt;a href="https://pdfmarkdown.app" rel="noopener noreferrer"&gt;pdfmarkdown.app&lt;/a&gt;, a free, browser-based PDF↔Markdown tool. I included direct competitors and tried to credit each one fairly. If you think I got a call wrong, tell me at &lt;a href="mailto:hey@pdfmarkdown.app"&gt;hey@pdfmarkdown.app&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>markdown</category>
      <category>pdf</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
