DEV Community

Vigoss Luke
Vigoss Luke

Posted on

Beyond pip install: MarkItDown in Production

Beyond pip install: MarkItDown in Production

MarkItDown is Microsoft's open-source library for converting documents to Markdown. A single pip install markitdown and you're converting DOCX, PDF, PPTX, and XLSX files in seconds.

But between "it works on my machine" and "it works in production," there's a gap the official docs don't cover.

What Breaks in Production

  • Silent failures: Encrypted PDFs return empty strings — no error, no warning
  • No timeout: Large PDFs can hang your pipeline with no way to cancel
  • Table scrambling: Merged cells and complex layouts lose structure
  • PDF noise: CID markers, duplicate sentences, zero heading hierarchy
  • Dependency fragility: Unpinned versions can silently break

What Helps

  • Batch processing: Reuse a single MarkItDown() instance across hundreds of files
  • Docker + FastAPI: Production-ready API with file size limits and timeout handling
  • PDF cleanup pipeline: Python script that strips noise, deduplicates, and restores structure
  • Right LLM for images: Claude 4 Sonnet wins on detail, GPT-4o wins on chart accuracy

MCP Server Integration

Turn MarkItDown into a Claude Desktop tool — paste a file path in chat and get clean Markdown instantly. No terminal, no scripts.

Full production guide with code, Docker setup, and LLM comparison. Also see the MarkItDown vs Unstructured vs LlamaParse comparison.

Top comments (0)