DEV Community

Tang Weigang
Tang Weigang

Posted on

Before You Let an Agent Convert Documents, Narrow the MarkItDown Boundary

MarkItDown looks like a simple utility at first glance: point it at a PDF, Word file, spreadsheet, image, HTML page, archive, audio file, or URL, and get Markdown back.

For an AI agent, that simplicity is exactly where I would slow down.

Once a coding agent or MCP host can call a document converter, the real question is not "can it convert files?" The real question is:

What is the smallest input surface the agent may touch, and what evidence proves the Markdown output is good enough for the next step?

This note is based on the independent Doramagic MarkItDown manual:

https://doramagic.ai/en/projects/markitdown/manual/

It is not an official Microsoft or MarkItDown document.

Use it for LLM ingestion, not layout reconstruction

MarkItDown is useful because Markdown is close to plain text while still preserving headings, lists, links, and some structural cues. That makes it practical for retrieval, summarization, indexing, and agent workflows.

But I would not treat it as a high-fidelity document renderer.

The first boundary check is plain:

  • If the task needs a human-perfect copy of a PDF layout, hold.
  • If the task needs searchable text with inspectable structure, continue.
  • If the task involves untrusted uploads or remote URLs, define the allowed paths and schemes before any agent runs it.

That distinction prevents a bad handoff where the agent reasons over Markdown that was never fit for the downstream claim.

Start with one format, not every optional dependency

The convenient install path is:

pip install "markitdown[all]"
Enter fullscreen mode Exit fullscreen mode

That is fine for exploration. It is not always the best first production-style test.

For a first agent workflow, I prefer a narrow install:

python -m venv .venv
source .venv/bin/activate
pip install "markitdown[pdf,docx]"
markitdown sample.pdf -o sample.md
Enter fullscreen mode Exit fullscreen mode

Then verify the artifact:

  • the Markdown file exists and is non-empty;
  • headings and links survived well enough for the task;
  • tables are manually sampled instead of trusted blindly;
  • the command, package version, input path, and output path are recorded;
  • the agent is instructed to read the Markdown artifact, not go back and inspect arbitrary source files.

That small test catches the most common mistake: calling "installation works" the same thing as "the document pipeline is safe."

PDF and OCR are where expectations drift

The Doramagic manual is most useful where it keeps the limits visible.

PDF conversion is best effort. Complex tables, headers, footers, multi-column layouts, scanned pages, and image-only documents can lose structure. The output may still be useful for search or rough summarization, but it should not become a source of record without sampling.

The markitdown-ocr plugin extends the system with LLM Vision OCR for embedded images and scanned PDFs. That can be the right tool, but it changes the operating model:

  • OCR may add model cost per page or image.
  • If no llm_client is provided, the plugin can load while OCR is skipped and the standard converter is used.
  • OCR output should be sampled before an agent uses it for legal, financial, medical, or compliance-sensitive reasoning.

My first-run decision rule is:

  • GO: controlled DOCX/PDF input, Markdown generated, sample checks pass.
  • HOLD: tables or scanned pages are readable but not structurally reliable.
  • NO-GO: sensitive scanned documents are pushed into an agent workflow before OCR quality and access boundaries are checked.

Treat the MCP server as a local tool, not a public service

MarkItDown also has an MCP server package. That is useful when an MCP-capable host needs a document-to-Markdown tool.

The safe default is local and narrow: trusted local agents, localhost binding, a small mounted work directory, and a clear rule for remote URLs.

A first instruction I would actually use:

Allow MarkItDown MCP to convert only files under /workdir/inbox/.
Do not fetch external URLs.
Write Markdown outputs under /workdir/out/.
Record input path, output path, file size, and MarkItDown version.
Stop on scanned PDFs, archives, remote URLs, empty output, or unknown extensions.
Enter fullscreen mode Exit fullscreen mode

That turns MarkItDown into a bounded capability. Without that boundary, a document converter can quietly become a broad file and URL access tool.

A compact acceptance checklist

Before I let an agent rely on MarkItDown output, I want these checks visible:

Check Acceptance rule
Install scope Use only needed extras, or explain why [all] is required.
Input boundary Restrict to known directories or approved URLs.
Output evidence Save Markdown and inspect headings, links, lists, and table samples.
PDF caveat Mark complex layouts and scanned pages as lower confidence.
OCR path Record whether markitdown-ocr, model client, and sampling are enabled.
MCP exposure Keep local by default; do not expose a converter server without auth and network controls.
Failure handling Empty output or unsupported format stops the agent workflow.

The practical value of MarkItDown is not that it magically understands every document. It gives an agent a narrow path from messy files to inspectable text. The narrower that path is on day one, the easier it is to trust the next step.

Sources:

Disclosure: this is a practitioner note based on an independent Doramagic capability manual and public MarkItDown repository material. It is not affiliated with or endorsed by Microsoft unless explicitly stated.

Top comments (0)