<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Edward</title>
    <description>The latest articles on DEV Community by Edward (@edward_4bd0f3553fd5ac06c5).</description>
    <link>https://dev.to/edward_4bd0f3553fd5ac06c5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3689082%2F14739ff7-9813-42f5-bf67-516c9de12b66.png</url>
      <title>DEV Community: Edward</title>
      <link>https://dev.to/edward_4bd0f3553fd5ac06c5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/edward_4bd0f3553fd5ac06c5"/>
    <language>en</language>
    <item>
      <title>Why Converting PDFs to Markdown Is Harder Than It Looks</title>
      <dc:creator>Edward</dc:creator>
      <pubDate>Fri, 02 Jan 2026 01:20:15 +0000</pubDate>
      <link>https://dev.to/edward_4bd0f3553fd5ac06c5/why-converting-pdfs-to-markdown-is-harder-than-it-looks-274d</link>
      <guid>https://dev.to/edward_4bd0f3553fd5ac06c5/why-converting-pdfs-to-markdown-is-harder-than-it-looks-274d</guid>
      <description>&lt;p&gt;When people hear “PDF to Markdown,” it often sounds like a simple text conversion task.&lt;/p&gt;

&lt;p&gt;In reality, working with PDFs — especially if you care about structure — is one of the trickiest parsing problems any developer tool can encounter.&lt;/p&gt;

&lt;p&gt;I ran into this repeatedly in documentation and LLM workflows, so I built a tool to tackle it. In this post, I’ll dig into &lt;strong&gt;why this problem is hard&lt;/strong&gt;, what usually goes wrong, and how a structure-aware pipeline can make Markdown outputs much more usable.&lt;/p&gt;




&lt;h2&gt;
  
  
  PDFs Are Not Semantic Documents — They’re Drawing Instructions
&lt;/h2&gt;

&lt;p&gt;A PDF file does &lt;em&gt;not&lt;/em&gt; encode paragraphs, headers, or tables as high-level concepts the way HTML or Markdown does.&lt;/p&gt;

&lt;p&gt;Instead it contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instructions to draw text at specific (x, y) coordinates&lt;/li&gt;
&lt;li&gt;Drawing commands for images, shapes, paths&lt;/li&gt;
&lt;li&gt;Transform matrices&lt;/li&gt;
&lt;li&gt;Optional metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is &lt;strong&gt;no “paragraph” object&lt;/strong&gt; in the format. All structure must be inferred from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Geometric proximity&lt;/li&gt;
&lt;li&gt;Font size and style&lt;/li&gt;
&lt;li&gt;Alignment and grouping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes the transition from PDF → Markdown fundamentally different from “text extraction.”&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Very Different Extraction Paths
&lt;/h2&gt;

&lt;p&gt;Before thinking about Markdown, we must decide which kind of PDF we’re dealing with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Native PDFs (Text Layer Exists)
&lt;/h3&gt;

&lt;p&gt;Many PDFs contain real text objects. These can be read natively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracted via PyMuPDF / pdf.js&lt;/li&gt;
&lt;li&gt;Include per-span positions (bboxes)&lt;/li&gt;
&lt;li&gt;Preserve font, glyph, and layout ordering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the best case for structural analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scanned PDFs (Image-Only Pages)
&lt;/h3&gt;

&lt;p&gt;Some PDFs are nothing but a stack of raster images (e.g., scans):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No text objects at all&lt;/li&gt;
&lt;li&gt;Everything must come from OCR&lt;/li&gt;
&lt;li&gt;No layout metadata remains&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These fundamentally lack block information, so document structure must be reconstructed from visual cues.&lt;/p&gt;

&lt;p&gt;Detecting which path to take is an essential first step. Treating scanned and native documents identically leads to poor outputs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Most Tools Produce Low-Quality Markdown
&lt;/h2&gt;

&lt;p&gt;Here are common failure modes in existing solutions:&lt;/p&gt;

&lt;h3&gt;
  
  
  Flattened Text
&lt;/h3&gt;

&lt;p&gt;Many PDF → Markdown tools simply dump text in reading order. That yields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Line breaks in the wrong places&lt;/li&gt;
&lt;li&gt;Lost paragraph boundaries&lt;/li&gt;
&lt;li&gt;Broken lists&lt;/li&gt;
&lt;li&gt;Missing semantic grouping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This may produce Markdown, but rarely Markdown that’s easy to work with.&lt;/p&gt;




&lt;h3&gt;
  
  
  Over-reliance on OCR
&lt;/h3&gt;

&lt;p&gt;OCR is critical for scans, but applying it to native text PDFs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduces noise&lt;/li&gt;
&lt;li&gt;Loses formatting&lt;/li&gt;
&lt;li&gt;Adds unnecessary preprocessing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The correct pipeline is to &lt;strong&gt;detect first, then decide&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Images With No Context
&lt;/h3&gt;

&lt;p&gt;Extracting images without knowing where they belong in the flow is worthless.&lt;/p&gt;

&lt;p&gt;In Markdown, image placement matters. Extracting raw image files without an insertion point loses meaning.&lt;/p&gt;

&lt;p&gt;A layout-aware pipeline sorts text and image blocks together to decide the right placement.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Block-Based Approach
&lt;/h2&gt;

&lt;p&gt;The key realization is to treat PDFs as a set of &lt;strong&gt;layout blocks&lt;/strong&gt;, each with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bounding box&lt;/li&gt;
&lt;li&gt;Page number&lt;/li&gt;
&lt;li&gt;Content type (text / image / table / code)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Sort all blocks by ascending (page, y, x)&lt;/li&gt;
&lt;li&gt;Merge spans into paragraphs and paragraphs into higher-level structures&lt;/li&gt;
&lt;li&gt;Reconstruct lists and tables based on geometric heuristics&lt;/li&gt;
&lt;li&gt;Insert images where they best fit relative to text blocks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach doesn’t magically discover hidden semantics. But it creates Markdown that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is readable&lt;/li&gt;
&lt;li&gt;Doesn’t require hours of cleanup&lt;/li&gt;
&lt;li&gt;Respects structural relationships better than flat extraction&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Scanned PDFs Are Image-First
&lt;/h2&gt;

&lt;p&gt;When native text blocks are absent, all blocks must be derived from visual content.&lt;/p&gt;

&lt;p&gt;In a scanned PDF:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Layout info is lost&lt;/li&gt;
&lt;li&gt;Text must come from OCR&lt;/li&gt;
&lt;li&gt;Blocks must be built from visual region detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a fundamentally different process than native parsing, and must be treated as such.&lt;/p&gt;

&lt;p&gt;In tools like &lt;a href="https://pdftomarkdown.pro" rel="noopener noreferrer"&gt;https://pdftomarkdown.pro&lt;/a&gt;, scanned PDFs are automatically detected and routed to OCR-based extraction. While OCR results are inherently noisier than native text extraction, this still provides usable Markdown where naive parsing would fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  Handling Complex Cases
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Tables
&lt;/h3&gt;

&lt;p&gt;PDFs don’t represent tables explicitly. You infer structure from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Column alignment&lt;/li&gt;
&lt;li&gt;Row proximity&lt;/li&gt;
&lt;li&gt;Grid lines if present&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Standard Markdown tables cannot express rowspan/colspan. For complex layouts, an HTML table fallback is often preferable.&lt;/p&gt;




&lt;h3&gt;
  
  
  Nested Lists
&lt;/h3&gt;

&lt;p&gt;Bullets and indentation are visual cues only. Reconstructing nested lists requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bullet pattern detection&lt;/li&gt;
&lt;li&gt;Relative indentation comparison&lt;/li&gt;
&lt;li&gt;Grouping across lines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is heuristic, but works reasonably well when implemented carefully.&lt;/p&gt;




&lt;h3&gt;
  
  
  Code Blocks
&lt;/h3&gt;

&lt;p&gt;Code is often recognizable by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monospaced fonts&lt;/li&gt;
&lt;li&gt;Consistent vertical spacing&lt;/li&gt;
&lt;li&gt;Absence of list/table markers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Distinguishing them accurately improves readability of outputs for technical docs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “Good Enough” Really Means
&lt;/h2&gt;

&lt;p&gt;A perfect round-trip from PDF to Markdown is impossible in the strict sense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PDF has no semantic document model&lt;/li&gt;
&lt;li&gt;OCR has inherent error rates&lt;/li&gt;
&lt;li&gt;Layout inference is heuristic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But a “good enough” solution is one where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Markdown is &lt;strong&gt;readable&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Structural elements aren’t mangled&lt;/li&gt;
&lt;li&gt;Images and tables aren’t orphaned&lt;/li&gt;
&lt;li&gt;Minimal manual cleanup is needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For documentation, note-taking, or LLM workflows, this is far more important than pixel-perfect fidelity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;PDF was designed for printing and visual fidelity, not semantic reuse.&lt;/p&gt;

&lt;p&gt;Converting it to Markdown is inherently a translation problem — from &lt;em&gt;geometry&lt;/em&gt; to &lt;em&gt;structure&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A structure-aware pipeline makes this translation far more reliable than naive extraction, and handling both native and scanned PDFs robustly is essential for real-world use.&lt;/p&gt;

&lt;p&gt;If you’d like to see a practical implementation of these ideas in action, check out &lt;a href="https://pdftomarkdown.pro/" rel="noopener noreferrer"&gt;https://pdftomarkdown.pro/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Feedback and edge-case examples are always welcome.&lt;/p&gt;

</description>
      <category>pdf</category>
      <category>ocr</category>
      <category>markdown</category>
      <category>developer</category>
    </item>
  </channel>
</rss>
