Beyond pip install: MarkItDown in Production
MarkItDown is Microsoft's open-source library for converting documents to Markdown. A single pip install markitdown and you're converting DOCX, PDF, PPTX, and XLSX files in seconds.
But between "it works on my machine" and "it works in production," there's a gap the official docs don't cover.
What Breaks in Production
- Silent failures: Encrypted PDFs return empty strings — no error, no warning
- No timeout: Large PDFs can hang your pipeline with no way to cancel
- Table scrambling: Merged cells and complex layouts lose structure
- PDF noise: CID markers, duplicate sentences, zero heading hierarchy
- Dependency fragility: Unpinned versions can silently break
What Helps
-
Batch processing: Reuse a single
MarkItDown()instance across hundreds of files - Docker + FastAPI: Production-ready API with file size limits and timeout handling
- PDF cleanup pipeline: Python script that strips noise, deduplicates, and restores structure
- Right LLM for images: Claude 4 Sonnet wins on detail, GPT-4o wins on chart accuracy
MCP Server Integration
Turn MarkItDown into a Claude Desktop tool — paste a file path in chat and get clean Markdown instantly. No terminal, no scripts.
Full production guide with code, Docker setup, and LLM comparison. Also see the MarkItDown vs Unstructured vs LlamaParse comparison.
Top comments (0)