DEV Community

Cover image for Stop Wasting Time Cleaning Up PDFs. Automate Your Document-to-Markdown Workflow.
Robinhill85
Robinhill85

Posted on

Stop Wasting Time Cleaning Up PDFs. Automate Your Document-to-Markdown Workflow.

If you've ever tried feeding a PDF into a RAG pipeline or importing research into Obsidian, you know the drill. The text comes out broken. Headers are mangled, tables are flattened, formatting is gone. You end up spending more time cleaning the output than you would have spent retyping the thing manually.

Here's how I stopped doing that.

The actual problem with PDF extraction

Most extraction tools treat a PDF as a flat stream of text. They don't understand structure. Headings, lists, code blocks, tables — all of it gets flattened into a wall of words in roughly the right order with none of the hierarchy intact.
For RAG this matters a lot. Poor structure means poor chunks, poor chunks mean poor retrieval, and your LLM ends up working with garbage context no matter how good your embeddings are. The problem starts way earlier in the pipeline than most tutorials acknowledge.

What clean Markdown actually buys you

Structured Markdown keeps the document hierarchy alive. H1s stay H1s. Lists stay lists. Tables stay tables. When you chunk a well-structured Markdown file, your chunks respect logical boundaries instead of slicing mid-thought through a section.

For Obsidian users it also just means your imported notes are actually usable. Navigable, linkable, readable. Not a wall of text with the formatting stripped out.

The workflow

I built file2markdown.ai (https://file2markdown.ai) to solve this for myself. It handles PDFs, Word docs, and images and returns clean structured Markdown.

1. Upload your file

Drag and drop via the web UI, or hit the API if you're automating a pipeline.

2. Get clean Markdown back

Heading hierarchy, lists, tables, code blocks all come through intact. Genuinely no cleanup required in most cases.

3. Drop it into your pipeline

Paste straight into Obsidian, feed it into your chunker, use it as context in your LLM workflow. Whatever you need.

Free to try

There's a free tier: 20 conversions a day, up to 25MB per file. No credit card, no setup, just try it. Paid plans are there if you're processing at higher volume.

If you're building RAG pipelines or just tired of wrestling with PDF text, give it a go and drop a comment if you run into anything interesting.

file2markdown.ai (https://file2markdown.ai)

Top comments (0)