DEV Community

Cover image for Your AI agent can't grep a PDF, and it's burning your tokens 🔥
Jerome
Jerome

Posted on • Originally published at pdfmarkdown.app

Your AI agent can't grep a PDF, and it's burning your tokens 🔥

Your coding agent can grep your whole repo in milliseconds. It can't treat a PDF the same way.

A PDF is not AI-friendly by default. Even when it contains selectable text, the structure that matters to an agent often gets lost or has to be guessed back: reading order, tables, formulas, captions, and figures. That extraction is lossy, and it is not free.

Here's what's going on under the hood, and why converting once to clean Markdown is the fix.

Disclosure: I build pdfmarkdown.app, an in-browser PDF→Markdown converter, so weigh that accordingly. I've kept the claims checkable, so test them yourself.

A PDF is a picture, not text

Most PDFs don't store your sentences. They store where each glyph sits on the page. "Married" might be saved as a run of positioned glyphs with no record that they form a word, that the word belongs to that paragraph, or that the left column should be read before the right one. (Tagged PDFs can carry logical structure and reading order, but in the wild they're rare or unreliable, so tools can't count on them.)

A human eye reassembles all that instantly. Software has to guess it back, and that guessing is where things break.

A PDF stores letters as scattered x,y coordinates with no order; Markdown stores them as ordered, structured lines.
A PDF knows where each glyph sits, not the order it should be read in. Markdown stores the order and the structure, which is exactly what a model needs.

The five places it breaks

When a converter (or a model) takes that guess, five things tend to fall apart, and they're the parts that carry the actual meaning:

Five ways PDFs break for AI: scanned pages are pure images, multi-column reading order scrambles, tables collapse into one line, formulas turn to gibberish, and figures get dropped.
The five breakpoints: scanned pages, multi-column order, tables, formulas, and images.

  • Scanned pages are just images. No text layer at all. Without OCR, the model "sees" a photo and quietly makes things up.
  • Multi-column pages read in the wrong order. A two-column paper gets stitched left-half-line then right-half-line, so sentences interleave into nonsense.
  • Tables collapse. Rows and columns flatten into one run-on line. The number that was under "2024" ends up floating next to a label from a different row.
  • Formulas turn to gibberish. E = mc² becomes E mc2, subscripts and superscripts drift, and an equation the paper is about becomes unreadable.
  • Figures lose their meaning. A chart gets dropped, or at best pulled out as a bare image. In a text or Markdown pipeline (RAG, search, an agent grepping over text), that image carries no meaning. A vision model could look at it, but your retrieval index and your grep can't.

The fix: clean Markdown is the format AI actually reads well

Markdown is plain text with light, explicit structure: # for headings, real rows and columns for tables, fenced blocks for code. The plainness is the whole point:

  • The structure is stated, not guessed. The reading order, the table shape and the hierarchy are all written down.
  • It's greppable and token-cheap. It's plain text, so an agent can search it line by line, and there's no binary cruft for a model to wade through.
  • Models were trained on mountains of it (every README, every wiki, every docs site), so they parse it natively.

Convert the PDF once into clean Markdown and you've done the hard, lossy extraction a single time, deliberately, instead of making every tool redo it (badly) on every query.

Where llms.txt fits in

This is the same idea behind llms.txt, an emerging convention where a site publishes a plain-Markdown map of its important content so AI tools can read it directly, instead of fighting through rendered HTML or PDFs. If you want AI to read something, hand it clean Markdown. A PDF on your disk and a webpage an AI crawls have the exact same problem, and the exact same fix.

Turning a PDF into AI-ready Markdown: what to watch

If you convert a PDF, judge the result on the parts that actually break, not on whether the first paragraph looks fine. Check four things:

  1. Did the tables survive as real rows and columns?
  2. Did the formulas survive as readable math?
  3. Were scanned pages recognized, or silently handed back as garbage?
  4. Did the figures make it into the output at all?

This is the bar I hold pdfmarkdown.app to: it runs in your browser, shows you the original PDF and the Markdown side by side so you can check those four things before you trust the output, and when a page is genuinely hard (a scan with no text layer) it says so up front instead of faking it. It's a floor I can show you, not a "perfect conversion" promise, because nobody can honestly make that one.

pdfmarkdown.app showing a PDF and its converted Markdown side by side, with the figure, caption and equation preserved.
Original PDF on the left, generated Markdown on the right. This is the Attention Is All You Need paper: the figure keeps its caption, and the equation comes through as real math.

"But models keep getting smarter, won't this just go away?"

Maybe the accuracy improves. Two things don't, and they get more important as agents take over, not less:

1. Tokens. A PDF has to be parsed into text before a model can do anything with it. In the naive pattern (attach the PDF to each chat) you re-pay that parse on every turn. Prompt caching and RAG soften it, but they're working around the same root cause: the PDF was never text to begin with. Convert it once to Markdown and the parse is done for good: cheap to embed, cheap to search, cheap to ask about.

2. Agents read on demand. Claude Code and Codex don't slurp whole files into context; they grep and search for the few lines they need, when they need them. A PDF can't be searched that way without first extracting it to text, which is exactly "convert it to Markdown." Do it once and your agent treats it like any other file in the repo.

An agent greps a Markdown file and pulls only the three relevant lines; with a PDF it has to extract the whole document to text first.
How an agent actually reads: Markdown lets it pull the three lines it needs. A PDF has to be decoded whole before it can search at all.

So the trend runs opposite to the intuition. As AI shifts from chatting with one document to agents navigating a whole library of code and docs, the PDF becomes a bigger bottleneck, not a smaller one. Better models make the agent pattern more common, which makes clean Markdown more necessary, not less.

"I just keep my PDFs in Obsidian, do I still need this?"

Especially then. A vault lives or dies on what you can search, link and fold into other notes, and a raw PDF sitting in it is a dead end: you can't [[link]] to a heading inside it, can't pull one paragraph into a daily note, can't grep it. Convert it to Markdown and the PDF becomes a first-class note like everything else, readable by you and by any AI you point at your vault.

The short version

  • Most PDFs store where glyphs sit, not what they say in what order, so anything reading one has to guess, and guesses worst on tables, formulas, multi-column pages, scans and figures.
  • You can't grep or embed a PDF until it's been extracted to text. Clean Markdown is that text, with the structure intact: greppable, token-cheap, and what models read natively. llms.txt is the same idea for the web.
  • Smarter models don't retire the problem. Token cost and agent-style on-demand reading make converting-once-to-Markdown more valuable over time.

Convert a PDF to clean Markdown once, glance over it to confirm the tables and formulas came through, and from then on every tool, model and agent you hand it to reads the real thing instead of guessing at the original.

Top comments (0)