If you've ever tried feeding a PDF into a RAG pipeline or importing research into Obsidian, you know the drill. The text comes out broken. Headers are mangled, tables are flattened, formatting is gone. You end up spending more time cleaning the output than you would have spent retyping the thing manually.
Here's how I stopped doing that.
The actual problem with PDF extraction
Most extraction tools treat a PDF as a flat stream of text. They don't understand structure. Headings, lists, code blocks, tables — all of it gets flattened into a wall of words in roughly the right order with none of the hierarchy intact.
For RAG this matters a lot. Poor structure means poor chunks, poor chunks mean poor retrieval, and your LLM ends up working with garbage context no matter how good your embeddings are. The problem starts way earlier in the pipeline than most tutorials acknowledge.
What clean Markdown actually buys you
Structured Markdown keeps the document hierarchy alive. H1s stay H1s. Lists stay lists. Tables stay tables. When you chunk a well-structured Markdown file, your chunks respect logical boundaries instead of slicing mid-thought through a section.
For Obsidian users it also just means your imported notes are actually usable. Navigable, linkable, readable. Not a wall of text with the formatting stripped out.
The workflow
I built file2markdown.ai (https://file2markdown.ai) to solve this for myself. It handles PDFs, Word docs, and images and returns clean structured Markdown.
1. Upload your file
Drag and drop via the web UI, or hit the API if you're automating a pipeline.
2. Get clean Markdown back
Heading hierarchy, lists, tables, code blocks all come through intact. Genuinely no cleanup required in most cases.
3. Drop it into your pipeline
Paste straight into Obsidian, feed it into your chunker, use it as context in your LLM workflow. Whatever you need.
Free to try
There's a free tier: 20 conversions a day, up to 25MB per file. No credit card, no setup, just try it. Paid plans are there if you're processing at higher volume.
If you're building RAG pipelines or just tired of wrestling with PDF text, give it a go and drop a comment if you run into anything interesting.
file2markdown.ai (https://file2markdown.ai)

Top comments (0)