Takeshi Fuchi

Posted on Jun 2

I Built a Service That Actually Converts PDFs to Markdown Correctly

#pdf #markdown #ai

Have you ever copy-pasted from a PDF only to get mangled line breaks, tables collapsed into a single line, formulas turned into gibberish, and figure captions floating somewhere completely wrong?

You want to summarize a PDF with an LLM, organize old papers in Notion, or dump internal docs into a knowledge base — the goal is simple. But the moment you hit "PDF text extraction," everything falls apart before you even start.

So I built pdfmd.net — upload a PDF, get back a properly structured Markdown file with headings, paragraphs, tables, LaTeX formulas, and figure references all intact.

"Why not just attach the PDF to GPT-5.5?"

Fair question. For a 1–2 page document, that works fine. Here's where the approaches differ:

	① Text Extraction Tools	② GPT-5.5 / Claude directly	③ pdfmd.net
How it works	Reads character coordinates from PDF internals	Attach PDF, ask "convert to Markdown"	Page PNG → tuned LLM → Markdown + images as ZIP
Speed / Cost	✅ Fast, free	❌ High-end models get expensive fast	✅ Cheap model by default, you choose
Structure preserved	❌ Tables collapse, formulas break, 2-column layout mixes	✅ Mostly, but varies with prompt	✅ Consistent every time
Long documents	✅ No page limit	❌ Quality drops past ~50 pages	✅ 200+ pages handled reliably
Images / figures	❌ Zero information	⚠️ Possible but you have to engineer it	✅ Extracted and linked in ZIP automatically
Batch processing	✅	❌ One file at a time, manually	✅

The real gap with "just using GPT-5.5" is that you have to re-engineer it every time:

Prompt required every time — "format tables in Markdown", "use LaTeX for formulas", "extract figures with filenames", "bundle as ZIP" — skip any of these and output quality varies unpredictably.
Long documents degrade — 50–200 page papers hit context limits, get cut off, or have visibly worse quality in the second half.
One file at a time — converting 10 papers means 10 upload-ask-copy cycles.
Cost adds up — processing large volumes through GPT-5.5 / Claude Sonnet class models gets expensive quickly.

How pdfmd works

For each page, pdfmd extracts two things:

PNG image (full-page render) → primary input to the Vision LLM
Text (via PyMuPDF) → hints to help the LLM read characters accurately

The Vision LLM uses the image as the ground truth for structure and the text as a supplement. This is why 2-column layouts don't bleed into each other, tables stay intact, and formulas come out as LaTeX — structure is understood visually, not guessed from character positions.

Figures and graphs are cropped from the page, saved as image files, and embedded in the Markdown as ![caption](./images/page3_fig1.png). The final output is a ZIP containing the .md file and all images.

On models and cost: pdfmd runs on your own API key. You choose the model; API costs go directly to the provider. The default is a Gemini Flash Lite-class model — heavily tuned prompts and pipeline squeeze high-quality output from cheap models. Switch to a more capable model anytime when precision matters. This — a tuned pipeline running on your key — is the fundamental difference from asking GPT-5.5 directly.

Just upload. A 200-page paper produces the same quality result every time, no prompt crafting required.

Seeing it in action

Complex formulas: raw extraction vs. pdfmd

Using a machine learning paper (Conditional GAN, arxiv:1906.05596) as the test case.

Eq. 1 — GAN minimax objective

Here's what the formula looks like in the PDF:

Text extraction shatters it across lines. pdfmd reconstructs it as complete LaTeX:

\\mathbb{E} (expectation), bold vector \\mathbf{x}, nested brackets — all recovered exactly. Text extraction can't tell min from a subscript or an equation number from a line break.

Eq. 9 — Piecewise (cases) function

In the PDF:

Text extraction leaves bare brackets with misaligned text. pdfmd uses \\begin{cases}:

Paste this into Obsidian, Notion, or a Jupyter notebook and the formula renders correctly.

Eq. 5 — Fraction + Frobenius norm

Compound expressions with fractions, multi-level subscripts, and norm notation:

\\frac{1}{WH}, subscript \\theta_G, Frobenius norm \\|\\cdot\\|_F^2, \\Omega_{256} with subscripts — all correct. Text extraction splits numerator and denominator onto separate lines and drops the norm delimiters entirely.

Tables: subscripts intact

Using an ADC datasheet (LTC2228/2227/2226) as the example. Electrical spec tables are full of subscripted symbols: VCM, VIH, IIN, IOUT.

The original PDF page:

Text extraction strips subscripts: VCM, VIH, IIN stay as plain text. pdfmd renders them all as proper LaTeX:

VCM → $V_{CM}$, VIH → $V_{IH}$, IIN → $I_{IN}$, IOUT = 0 → $I_{OUT} = 0$.

In a hardware spec, VCM and $V_{CM}$ mean the same thing to a human but are different strings to any downstream system — search, LLM context, RAG retrieval. The subscript isn't cosmetic; it's semantic.

Page breaks + figure interruptions: text reconnected

Using a 2-column academic paper from SIGMOD'18 (GPU parallel top-k algorithm). A paragraph spans two pages with a figure inserted mid-sentence.

The source PDF (two pages stitched):

Text extraction output:

The paragraph is mid-sentence when the page footer fires:

…This results in two bitonic sequences,
(S[0], ..S[l/2 −1]) and (S[l/2], ...S[l −1])
where all the elements in the first subsequence
are smaller than any element in the second
subsequence.
Research 15: Databases for Emerging Hardware     ← page footer
SIGMOD'18, June 10-15, 2018, Houston, TX, USA   ← session header
1558                                             ← page number
Phase  Step
1      1
2      1  2  3  4                                ← figure data
(a) Algorithm.
Unsorted Input
After Phase 1
…
In the second step, the same procedure           ← paragraph resumes here
is applied to both the subsequences…

pdfmd output:

…This results in two bitonic sequences, $(S[0], \\dots S[l/2 - 1])$ and
$(S[l/2], \\dots S[l - 1])$ where all the elements in the first subsequence
are smaller than any element in the second subsequence. In the second step,
the same procedure is applied to both the subsequences, resulting in four
bitonic sequences…

![Bitonic Sorting Network](./images/gputopk_sigmod18_page3_fig1.png)
**Figure 3:** Bitonic Sorting Network

Page footer, session header, page number, and raw figure data all stripped. Paragraph text flows continuously. Figure placed at the right position with a proper reference.

Complex multi-column Japanese layout

Multi-column Japanese documents — municipal newsletters, magazine spreads, textbooks mixing vertical and horizontal text — are the hardest layout class for text extraction. Columns bleed into each other, vertical text comes out one character per line, and decorative borders become garbage characters.

Here's a one-page municipal newsletter put through pdfmd:

PyMuPDF text extraction output:

まちだ市民大学ＨＡＴＳ
まちだ市民大学ＨＡＴＳ
まちだ市民大学ＨＡＴＳ      ← heading repeated 3× (three columns)
受講生募集
襖
鴬                           ← decorative characters garbled
横横横横横横横横横横横横横横  ← border lines as characters

対市内在住、在勤、在学の、原則、全回出席できる方
場陶芸に関する講座＝陶芸スタジオ（下小山田町）、

日９月１０日～１２月１７日の月曜日     ← different section mixed in
人間関係学～人間関係の多様性と向き合   ← reading order scrambled

玉                           ← vertical text: one char per line
川
学
園
子
ど
も
…

pdfmd output:

## 催し ご参加を

### 玉川学園子どもクラブ ころころ児童館

#### 【7月のわくわくWeek「水鉄砲合戦～広場決戦の巻」】
自分の水鉄砲を持ってきて参加できます。びしょぬれになるので、
着替えが必要な方はお持ち下さい。
*   **対** 小学生以上の方
*   **日** 7月23日(月)～8月3日(金)、いずれも午後3時30分～5時(雨天中止)

### まちだ市民大学HATS 2012年度後期講座 受講生募集

*   **対** 市内在住、在勤、在学の、原則、全回出席できる方
*   **申** 7月25日正午～8月24日に電話でイベントダイヤル(📞724・5656)へ。
*   **問** 生涯学習センター 📞728・0071 FAX728・0073

Vertical text reconstructed, 3-column layout correctly ordered, ## → ### → #### heading hierarchy preserved, phone numbers and labels intact.

Multilingual: works the same way

Because the approach is Vision LLM-based, language is not a special case. Here's a Chinese academic paper:

# CUDA 并行计算技术在情报信息研判中的应用

**摘要：** 文章在研究公安情报信息研判技术的基础上…

$$W_{ik} = \\frac{tf_{ik} \\log(N/n_k + 0.01)}{\\sqrt{\\sum_{k=1}^n [tf_{ik} \\log(N/n_k + 0.01)]^2}} \\tag{1}$$

Supported languages: English, Japanese, Simplified Chinese, Traditional Chinese, Spanish, French, Portuguese, Russian, German, Turkish, Korean, Italian, Dutch — and the UI is translated into all of them.

Figures and graphs described, not just extracted

For charts and graphs — content that can't be extracted as text — pdfmd saves the image and writes a caption describing what it shows:

![Boxplots of Dice scores](./images/page7_fig2.png)
**Fig. 5.** Boxplots of Dice scores for various anatomical structures for ANTs,
NiftyReg, and VoxelMorph. Structures are ordered by average ANTs Dice score.

Ask an LLM "what does Fig. 5 show?" later, and it can actually answer — because the description is in the Markdown alongside the image reference.

When to use it

Moving papers, contracts, or internal docs into Notion or Obsidian
Preprocessing PDFs as RAG source documents for LLM pipelines
Quoting from a PDF in a blog post or report — paste the Markdown directly
Cleaning up multilingual material before sending to DeepL or GPT

If you just need text search inside a PDF, a viewer does that fine. pdfmd is for when you need to do something with the Markdown afterward.

Try it

pdfmd.net — sign up and get 50 pages free. 1 point = 1 page. Upload a file, wait a moment, download the ZIP. That's it.

The API key is free too. pdfmd runs on your own API key, but Google AI Studio lets you generate a Gemini API key at no cost. The default model (Gemini Flash Lite class) runs within AI Studio's free tier — so those 50 signup pages are completely free end-to-end, API costs included. No credit card required anywhere.

If you've ever fed a badly-extracted PDF into an LLM and gotten back a confused or hallucinated answer, try running the same PDF through pdfmd first. The difference is usually immediate.

DEV Community