Vigoss Luke

Posted on Jun 9 • Originally published at markitdown-pro.com

How I Batch-Convert 100+ Documents to Markdown for LLM Ingestion — 3 Practical Scripts

#ai #llm #python #datascience

How I Batch-Convert 100+ Documents to Markdown for LLM Ingestion — 3 Practical Scripts

I had 300 PDFs, 50 DOCX files, and a pile of PPTX decks sitting in a directory — all the internal docs from three years of client projects. I needed clean Markdown for my LLM pipeline, and "open one by one and copy-paste" wasn't going to cut it.

Here's how I got it done in an afternoon with MarkItDown and three scripts.

Why Markdown Matters for LLMs

Before showing the code, let's talk about why this matters. LLMs charge by the token. Here's what that looks like in practice:

# HTML — 23 tokens just for the boilerplate
'<h1 class="title" id="intro">Introduction</h1>'

# Markdown — 3 tokens for the same heading
'# Introduction'

That's a 7.6x efficiency gap on every heading, every paragraph wrapper, every table cell. When you're processing hundreds of documents into an LLM context window, the difference between raw HTML and clean Markdown can mean 3–8x fewer tokens. That translates directly to lower API costs, faster inference, and more documents fitting into a single context window.

MarkItDown is Microsoft's open-source document-to-Markdown converter — 140K+ GitHub stars, MIT licensed, and gaining ~200 stars a day. It handles PDF, DOCX, PPTX, Excel, and 10+ other formats, all converting to clean, consistent Markdown.

Script 1: batch_convert.py — Recursive Directory Converter

This is the workhorse. Point it at a directory, and it recursively finds every supported file, converts it, and drops the .md next to the original.

#!/usr/bin/env python3
"""Batch convert documents to Markdown using MarkItDown."""
import argparse
import sys
from pathlib import Path
from markitdown import MarkItDown

SUPPORTED_EXTENSIONS = {
    '.pdf', '.docx', '.pptx', '.xlsx', '.html', '.htm',
    '.csv', '.json', '.xml', '.zip', '.msg', '.eml',
    '.rtf', '.odt', '.ods', '.odp', '.epub',
}


def collect_files(root: Path) -> list[Path]:
    """Recursively collect all supported files."""
    files = []
    for path in root.rglob('*'):
        if path.is_file() and path.suffix.lower() in SUPPORTED_EXTENSIONS:
            files.append(path)
    return files


def convert_file(md: MarkItDown, input_path: Path, output_path: Path) -> bool:
    """Convert a single file to Markdown. Returns True on success."""
    try:
        result = md.convert(str(input_path))
        output_path.write_text(result.text_content, encoding='utf-8')
        return True
    except Exception as e:
        print(f"  ✗ {input_path.name}: {e}", file=sys.stderr)
        return False


def main():
    parser = argparse.ArgumentParser(
        description="Batch convert documents to Markdown"
    )
    parser.add_argument('directory', type=Path,
                        help='Root directory to scan recursively')
    parser.add_argument('--output-dir', type=Path, default=None,
                        help='Output directory (mirrors source structure)')
    parser.add_argument('--dry-run', action='store_true',
                        help='List files without converting')
    args = parser.parse_args()

    if not args.directory.is_dir():
        print(f"Error: {args.directory} is not a directory", file=sys.stderr)
        sys.exit(1)

    files = collect_files(args.directory)
    if not files:
        print(f"No supported files found in {args.directory}")
        return

    print(f"Found {len(files)} file(s) to convert:")
    for f in files:
        print(f"  {f.relative_to(args.directory)}")

    if args.dry_run:
        return

    md = MarkItDown()
    success = 0
    for input_path in files:
        rel = input_path.relative_to(args.directory)
        if args.output_dir:
            output_path = args.output_dir / rel.with_suffix('.md')
            output_path.parent.mkdir(parents=True, exist_ok=True)
        else:
            output_path = input_path.with_suffix('.md')

        print(f"Converting {rel}...", end=' ')
        if convert_file(md, input_path, output_path):
            print("✓")
            success += 1

    print(f"\nDone: {success}/{len(files)} converted successfully.")


if __name__ == '__main__':
    main()

Usage is dead simple:

pip install markitdown
python batch_convert.py ./client-docs --output-dir ./markdown-output

Point it at a directory, it recursively finds every PDF, DOCX, PPTX, Excel, HTML, CSV, JSON, XML, email, RTF, ODF, and EPUB file, converts each one, and mirrors the directory structure in the output folder.

Script 2: server.py — When You Need an API Instead of CLI

Sometimes you're building a pipeline and need to trigger conversions via HTTP — from an n8n workflow, a Zapier webhook, or your own frontend. Here's a FastAPI wrapper that exposes MarkItDown as a REST API:

#!/usr/bin/env python3
"""FastAPI server wrapping MarkItDown for API-based conversion."""
import tempfile
import shutil
from pathlib import Path
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import PlainTextResponse
from markitdown import MarkItDown

app = FastAPI(title="MarkItDown API", version="1.0.0")
converter = MarkItDown()

SUPPORTED_CONTENT_TYPES = {
    'application/pdf', 'application/vnd.openxmlformats-officedocument.'
    'wordprocessingml.document', 'application/vnd.openxmlformats-officedocument.'
    'presentationml.presentation', 'application/vnd.openxmlformats-officedocument.'
    'spreadsheetml.sheet', 'text/html', 'text/csv', 'application/json',
    'text/xml', 'application/zip', 'message/rfc822', 'text/rtf',
    'application/epub+zip', 'application/vnd.oasis.opendocument.text',
}


@app.post("/convert", response_class=PlainTextResponse)
async def convert_file(file: UploadFile = File(...)):
    """Upload a document, get back Markdown."""
    if file.content_type and file.content_type not in SUPPORTED_CONTENT_TYPES:
        raise HTTPException(
            status_code=415,
            detail=f"Unsupported type: {file.content_type}"
        )

    suffix = Path(file.filename).suffix if file.filename else '.tmp'
    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
        shutil.copyfileobj(file.file, tmp)
        tmp_path = Path(tmp.name)

    try:
        result = converter.convert(str(tmp_path))
        return result.text_content
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
    finally:
        tmp_path.unlink(missing_ok=True)


@app.get("/health")
async def health():
    return {"status": "ok", "formats": sorted(SUPPORTED_CONTENT_TYPES)}


if __name__ == '__main__':
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Now any service in your stack can POST a file and get Markdown back:

curl -X POST http://localhost:8000/convert \
  -F "file=@report.docx"

Script 3: pdf_cleanup.py — Post-Processing for Messy PDFs

PDF conversion is where things get ugly. Scanned documents come out as empty strings, multi-column layouts produce jumbled text, and watermarks get scattered through the output. This script cleans up the most common artifacts:

#!/usr/bin/env python3
"""Post-process MarkItDown output for cleaner LLM-ready text."""
import re
import sys
from pathlib import Path


def clean_markdown(text: str) -> str:
    """Apply a pipeline of cleanup operations."""

    # Collapse 3+ blank lines into 2
    text = re.sub(r'\n{3,}', '\n\n', text)

    # Strip trailing whitespace on each line
    text = re.sub(r'[ \t]+$', '', text, flags=re.MULTILINE)

    # Remove lines that are entirely non-alphanumeric (watermarks, separators)
    text = re.sub(
        r'^\s*[^a-zA-Z0-9\u4e00-\u9fff]{3,}\s*$',
        '',
        text,
        flags=re.MULTILINE
    )

    # Fix broken hyphenation: "con-\nvert" → "convert"
    text = re.sub(r'(\w)-\n(\w)', r'\1\2', text)

    # Merge single-line-break paragraphs that aren't actually separate
    # (keeps intentional blank-line paragraph separators)
    text = re.sub(
        r'([^\n])\n([^\n#\-\*\d\s])',
        r'\1 \2',
        text
    )

    # Normalize Unicode quotes and dashes
    text = text.replace('\u2018', "'").replace('\u2019', "'")
    text = text.replace('\u201c', '"').replace('\u201d', '"')
    text = text.replace('\u2013', '--').replace('\u2014', '---')

    return text.strip()


def main():
    import argparse
    parser = argparse.ArgumentParser(
        description='Clean up MarkItDown output for LLM use'
    )
    parser.add_argument('files', type=Path, nargs='+',
                        help='Markdown files to clean')
    parser.add_argument('--in-place', action='store_true',
                        help='Modify files in place')
    parser.add_argument('--output-dir', type=Path, default=None,
                        help='Write cleaned files to a separate directory')
    args = parser.parse_args()

    for filepath in args.files:
        if not filepath.suffix == '.md':
            print(f"Skipping non-markdown file: {filepath}")
            continue

        raw = filepath.read_text(encoding='utf-8')
        cleaned = clean_markdown(raw)

        if args.output_dir:
            out = args.output_dir / filepath.name
            out.write_text(cleaned, encoding='utf-8')
        elif args.in_place:
            filepath.write_text(cleaned, encoding='utf-8')
        else:
            print(f"=== {filepath.name} ===")
            print(cleaned)
            print()

        savings = len(raw) - len(cleaned)
        if savings > 0:
            print(f"  {filepath.name}: removed {savings} chars ({savings * 100 // len(raw)}%)")


if __name__ == '__main__':
    main()

Run it after batch_convert to clean up all your output:

python pdf_cleanup.py ./markdown-output/*.md --in-place

Docker Option: One Command, Zero Setup

If you don't want to deal with Python environments (looking at you, PDF dependencies), everything's containerized:

git clone https://github.com/Jakeshadow/markitdown-batch-examples.git
cd markitdown-batch-examples

# Mount your documents and run
docker compose run --rm -v /path/to/docs:/input converter \
  python batch_convert.py /input --output-dir /input/markdown

The Docker image includes all the heavy dependencies (pdfminer, python-docx, openpyxl) so you don't fight with system libraries.

MarkItDown vs Unstructured.io: When to Use Which

I evaluated both before committing to MarkItDown. Here's the quick breakdown:

MarkItDown: MIT license, Python-native, dead simple API (md.convert("file.pdf")), produces clean semantic Markdown. Best for batch conversion pipelines where you want LLM-ready output with zero configuration.
Unstructured.io: Apache 2.0, supports more esoteric formats (JPG OCR, EML parsing with metadata), but heavier dependencies and more complex setup. Better if you need structured chunking metadata alongside the text.

For my use case — batch converting internal docs for LLM pipelines — MarkItDown won on simplicity and output quality. No configuration files, no partitioning strategies to configure. Just point it at a file and get Markdown.

Full comparison with code examples on both sides: MarkItDown vs Unstructured.io.

The Takeaway

Three scripts, one afternoon, and 300+ documents went from a mess of proprietary formats to clean, token-efficient Markdown. The whole pipeline is:

batch_convert.py — mass conversion of everything in a directory
pdf_cleanup.py — post-process to fix PDF artifacts
server.py — REST API when you need programmatic access

The full guide with installation walkthroughs, Docker setup, and performance benchmarks is at markitdown-pro.com. All three scripts plus Docker Compose files are in the GitHub companion repo.

Stop opening files one by one. Let the scripts do it.

DEV Community

How I Batch-Convert 100+ Documents to Markdown for LLM Ingestion — 3 Practical Scripts

How I Batch-Convert 100+ Documents to Markdown for LLM Ingestion — 3 Practical Scripts

Why Markdown Matters for LLMs

Script 1: batch_convert.py — Recursive Directory Converter

Script 2: server.py — When You Need an API Instead of CLI

Script 3: pdf_cleanup.py — Post-Processing for Messy PDFs

Docker Option: One Command, Zero Setup

MarkItDown vs Unstructured.io: When to Use Which

The Takeaway

Top comments (0)