<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dmitry Petrakov</title>
    <description>The latest articles on DEV Community by Dmitry Petrakov (@dimlight).</description>
    <link>https://dev.to/dimlight</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4007809%2Fbb725926-329b-4cb6-b72c-be8b6b61b2a3.png</url>
      <title>DEV Community: Dmitry Petrakov</title>
      <link>https://dev.to/dimlight</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dimlight"/>
    <language>en</language>
    <item>
      <title>What actually breaks when you turn PDFs into Markdown</title>
      <dc:creator>Dmitry Petrakov</dc:creator>
      <pubDate>Mon, 29 Jun 2026 12:05:08 +0000</pubDate>
      <link>https://dev.to/dimlight/what-actually-breaks-when-you-turn-pdfs-into-markdown-148h</link>
      <guid>https://dev.to/dimlight/what-actually-breaks-when-you-turn-pdfs-into-markdown-148h</guid>
      <description>&lt;p&gt;"Convert a PDF to Markdown" sounds like a solved problem. Take the text out, turn headings into &lt;code&gt;#&lt;/code&gt;, turn tables into pipes, done.&lt;/p&gt;

&lt;p&gt;After building a converter for it, I have a less satisfying answer: the easy cases are easy, and the hard cases are not edge cases. They are the documents people actually care about – research papers, annual reports, invoices, scanned contracts, specs, and the table-heavy PDFs someone wants to feed into an LLM.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Disclosure: I build &lt;a href="https://pdf2md.dev/" rel="noopener noreferrer"&gt;pdf2md.dev&lt;/a&gt;, so I have skin in this game. This is not a benchmark claiming "we're the best." It is a breakdown of the failure modes I had to handle and the trade-offs I made – written so it's useful even if you never touch my tool and just want to evaluate your own.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  A PDF is not a document
&lt;/h2&gt;

&lt;p&gt;The core problem is that a PDF does not usually contain "a document" the way Markdown, HTML, or DOCX does.&lt;/p&gt;

&lt;p&gt;It contains drawing instructions: put these glyphs at these coordinates, draw this line here, place this image there. The structure you see as a reader is reconstructed by your eyes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;that larger bold text is &lt;em&gt;probably&lt;/em&gt; a heading&lt;/li&gt;
&lt;li&gt;those aligned numbers are &lt;em&gt;probably&lt;/em&gt; a table&lt;/li&gt;
&lt;li&gt;that block on the left should be read before the block on the right&lt;/li&gt;
&lt;li&gt;that superscript belongs to a formula&lt;/li&gt;
&lt;li&gt;that scanned page has no text layer at all&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A converter has to rebuild all of that from layout, geometry, OCR, and heuristics. If it only "extracts text," it will work on the demo PDF and fall apart on the first real report.&lt;/p&gt;

&lt;p&gt;Here are the main things that break, roughly ordered by how often they bite.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Tables are not one problem
&lt;/h2&gt;

&lt;p&gt;Simple tables are fine. If the PDF has a clean grid and each cell maps to one row and one column, Markdown is a good target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1      | $1.2M   | +8%    |
| Q2      | $1.4M   | +17%   |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Renders to:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Quarter&lt;/th&gt;
&lt;th&gt;Revenue&lt;/th&gt;
&lt;th&gt;Growth&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Q1&lt;/td&gt;
&lt;td&gt;$1.2M&lt;/td&gt;
&lt;td&gt;+8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Q2&lt;/td&gt;
&lt;td&gt;$1.4M&lt;/td&gt;
&lt;td&gt;+17%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trouble starts when the table stops being a simple grid.&lt;/p&gt;

&lt;p&gt;Merged cells do not map cleanly to GitHub-flavored Markdown. Nested headers have a hierarchy Markdown tables cannot represent. Rotated tables add a reading-order problem before you even get to cell detection. Borderless tables are worst of all, because the grid exists only as alignment.&lt;/p&gt;

&lt;p&gt;This is where converters quietly get dishonest. They return a table that &lt;em&gt;looks&lt;/em&gt; tidy but has shifted columns, duplicated headers, or numbers attached to the wrong labels. That is more dangerous than an obvious failure – especially if the output flows into an LLM or a RAG index, where nobody re-reads it.&lt;/p&gt;

&lt;p&gt;My rule for this class of problem: preserve as much structure as the target format can honestly express, and don't pretend Markdown can encode everything a PDF table visually implies. Straight grids come out ready to use. Complex financial or scientific tables may still need a visual check. That is less magical, but it is the difference between saving time and silently corrupting data.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Reading order is layout analysis, not text extraction
&lt;/h2&gt;

&lt;p&gt;Academic papers, magazines, datasheets, and many reports use two or three columns. A naive extractor reads across the page by x/y position and produces nonsense:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;First line of column one first line of column two
second line of column one second line of column two
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The right behavior is to detect column boundaries, read each column top-to-bottom, then move on. That requires layout analysis – the text stream alone is not enough.&lt;/p&gt;

&lt;p&gt;The same problem hits sidebars, captions, footnotes, running headers, and page numbers. A human ignores a repeated header automatically; a converter has to decide whether those fragments are content, metadata, or noise. Get it wrong and you don't just produce ugly Markdown – you change the meaning.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Formulas are reconstructed, not copied
&lt;/h2&gt;

&lt;p&gt;Mathematical notation is a layout problem too. In a PDF, a formula is a set of glyphs placed carefully on the page: &lt;code&gt;∑&lt;/code&gt;, &lt;code&gt;√&lt;/code&gt;, superscripts, subscripts, fraction bars, Greek letters, spacing. Turning that back into something usable means producing LaTeX-like text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;$$
&lt;span class="se"&gt;\s&lt;/span&gt;um_{i=1}^{n} x_i
$$
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Renders to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fepjzkv4cymxousapwh17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fepjzkv4cymxousapwh17.png" alt="Rendered formula: the sum from i=1 to n of x sub i" width="239" height="166"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If the converter only sees characters in approximate order, an equation becomes a line of floating symbols – useless for documentation, search, or LLM context. This is why I don't trust regex-only PDF pipelines for technical documents. They're fine for plain text; formulas need the converter to understand visual structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Scanned PDFs change the entire pipeline
&lt;/h2&gt;

&lt;p&gt;A scanned PDF may have no embedded text at all – it's just images of pages. Now the problem is OCR, with its own failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scan quality dominates everything&lt;/li&gt;
&lt;li&gt;skewed or low-contrast pages hurt recognition&lt;/li&gt;
&lt;li&gt;tiny text and dense tables are slow&lt;/li&gt;
&lt;li&gt;handwriting is not reliably recognized&lt;/li&gt;
&lt;li&gt;OCR produces &lt;em&gt;plausible-looking&lt;/em&gt; mistakes, which are the worst kind&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For printed or typeset text, good scans convert well. A sharp 300 DPI page with high contrast is a completely different input from a crooked phone photo of a faded fax.&lt;/p&gt;

&lt;p&gt;There's also a product decision every converter has to make: &lt;strong&gt;what happens when a long scan exceeds the processing budget?&lt;/strong&gt; Failing the whole job is simple to implement and a terrible experience. The behavior I chose is to return the Markdown produced within the budget and mark the job as &lt;strong&gt;truncated&lt;/strong&gt; – a partial result with an explicit signal, instead of losing everything. The signal is the important part. A partial result &lt;em&gt;without&lt;/em&gt; a truncation marker is just another form of silent data loss.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Images are either content or noise
&lt;/h2&gt;

&lt;p&gt;Images in PDFs are ambiguous. Sometimes they're essential – diagrams, charts, stamps, signatures. Sometimes they're decorative backgrounds. Sometimes the whole page is an image but the user wants text, not embedded base64.&lt;/p&gt;

&lt;p&gt;So "include images" is not one setting. The practical version is three different intents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;embed&lt;/strong&gt; images when the Markdown should be self-contained&lt;/li&gt;
&lt;li&gt;use &lt;strong&gt;placeholders&lt;/strong&gt; when the user wants clean text output&lt;/li&gt;
&lt;li&gt;run &lt;strong&gt;OCR&lt;/strong&gt; on scanned pages when text needs to be recovered&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no universal best choice. A Markdown file headed for a knowledge base, an LLM prompt, or a legal archive each wants different output.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. The converter itself can fail before the Markdown does
&lt;/h2&gt;

&lt;p&gt;The visible part of a converter is the Markdown. The part that decides whether you can trust it is job reliability, and those failures are boring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a conversion hangs forever&lt;/li&gt;
&lt;li&gt;a heavy OCR job runs out of memory&lt;/li&gt;
&lt;li&gt;a worker dies halfway through&lt;/li&gt;
&lt;li&gt;a job gets retried too aggressively&lt;/li&gt;
&lt;li&gt;one large file blocks everyone else&lt;/li&gt;
&lt;li&gt;the user closes the tab before the result is ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where a weekend script and a service diverge. My implementation ended up with a real job lifecycle:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3crig8g2weq7m96l9hhb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3crig8g2weq7m96l9hhb.png" alt="Conversion job lifecycle: queued to processing, then ready, canceled or error" width="799" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The system tracks each job, retries bounded failures, applies time budgets, deletes input files after processing, and keeps results only for a short retention window. Those limits are not glamorous, but they &lt;em&gt;are&lt;/em&gt; part of trust. A converter that accepts anything, promises instant results, and never explains retention isn't more user-friendly – it's just hiding the operational reality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why two engines instead of one
&lt;/h2&gt;

&lt;p&gt;There is no single engine that wins on every document, so I run two.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MinerU&lt;/strong&gt; is the default. It holds up better on dense documents, heavy OCR, Cyrillic content, and table-heavy scans, and it's the safer choice under memory pressure. &lt;strong&gt;Docling&lt;/strong&gt; is an opt-in: faster and cleaner on simple, well-structured text PDFs, but less forgiving on heavy full-OCR workloads.&lt;/p&gt;

&lt;p&gt;So the question isn't "which engine is best?" – it's "which engine is best for &lt;em&gt;this&lt;/em&gt; document?" That's an unsatisfying marketing answer and a useful engineering one.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I'd evaluate any PDF-to-Markdown tool
&lt;/h2&gt;

&lt;p&gt;If you're picking a converter, mine or anyone's, don't start with the landing page. Test it with documents that expose different failure modes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;a simple text PDF with headings and lists&lt;/li&gt;
&lt;li&gt;a two-column paper with footnotes&lt;/li&gt;
&lt;li&gt;a table with merged headers&lt;/li&gt;
&lt;li&gt;a scanned invoice or contract&lt;/li&gt;
&lt;li&gt;a technical paper with formulas&lt;/li&gt;
&lt;li&gt;a document with screenshots or diagrams&lt;/li&gt;
&lt;li&gt;a long PDF that might hit a time budget&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As you read the output, the useful questions split in two. First, did the structure survive: does the reading order match the original, are the tables actually correct rather than merely tidy, do the formulas come back usable, and does the OCR admit when it can't read handwriting instead of inventing words? Check for the truncation marker too, because a partial result that isn't labelled as partial is a quiet failure.&lt;/p&gt;

&lt;p&gt;The second group is the one people skip, and it's the one that matters most: can you find the file-size limit, the retention window and the privacy policy without digging, can you delete a job yourself, and does the tool &lt;em&gt;explain&lt;/em&gt; its limits instead of hiding them? PDF conversion usually touches private documents, so a converter has to earn trust before output quality even comes up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy, in plain language
&lt;/h2&gt;

&lt;p&gt;Here's the model I wanted, stated the way I think every converter should state it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you can convert without creating an account&lt;/li&gt;
&lt;li&gt;uploaded PDFs are deleted after processing&lt;/li&gt;
&lt;li&gt;results are kept only for a short download window, then removed automatically&lt;/li&gt;
&lt;li&gt;you can delete a job manually&lt;/li&gt;
&lt;li&gt;documents are never used to train models&lt;/li&gt;
&lt;li&gt;documents are not sold or used for advertising&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full &lt;a href="https://pdf2md.dev/privacy/" rel="noopener noreferrer"&gt;privacy notice&lt;/a&gt; spells it out, and the &lt;a href="https://pdf2md.dev/developers/" rel="noopener noreferrer"&gt;developer docs&lt;/a&gt; cover the API and a hosted MCP endpoint for agent workflows. I'm putting this in the article because retention and training policy are product features when you're asking people to upload contracts and reports, not fine print to bury three clicks deep.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try your worst PDF
&lt;/h2&gt;

&lt;p&gt;The best test isn't a clean sample document. It's the PDF that already broke your previous workflow – a dense financial table, a two-column paper full of formulas, a long scanned report, something you want to feed into an LLM without losing structure.&lt;/p&gt;

&lt;p&gt;Throw it at &lt;a href="https://pdf2md.dev/app/" rel="noopener noreferrer"&gt;the web app&lt;/a&gt; and see how far the honest 90% gets you. No signup, files auto-deleted.&lt;/p&gt;

&lt;p&gt;If it works, great. If it breaks, I genuinely want to know which document exposed the failure – &lt;strong&gt;drop the kind of PDF you're fighting in the comments.&lt;/strong&gt; The last 20% of this problem isn't one bug; it's a long list of document-specific edge cases, and real examples are how converters get better.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Written from first-hand work on the project; I used an AI assistant to tighten the structure, not to invent the technical claims.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>showdev</category>
      <category>webdev</category>
      <category>ai</category>
      <category>markdown</category>
    </item>
  </channel>
</rss>
