The Problem
If you are building AI applications with LLMs, you know the pain: raw HTML is useless for training data. You need clean, structured Markdown.
Most solutions like Firecrawl or Crawl4AI require setup, dependencies, and often paid plans.
The Manual Way
You could write your own parser:
import re
import urllib.request
def html_to_markdown(url):
html = urllib.request.urlopen(url).read().decode()
# Remove scripts, styles
html = re.sub(r"<script.*?</script>", "", html, flags=re.DOTALL)
html = re.sub(r"<style.*?</style>", "", html, flags=re.DOTALL)
# Convert headings
for i in range(6, 0, -1):
html = re.sub(r"<h%d.*?>(.*?)</h%d>" % (i,i), "#"*i + r" 1
", html)
# Strip remaining tags
return re.sub(r"<[^>]+>", "", html).strip()
But this breaks on complex pages, misses metadata, and requires constant maintenance.
The Automated Way
I built a free tool that handles all of this automatically. It extracts clean Markdown from any URL, preserves document structure, headings, links, and metadata.
Perfect for:
- RAG pipelines - feed clean text into vector databases
- LLM fine-tuning - structured training data from any website
- Content analysis - extract and analyze web content at scale
The tool runs on Apify, so no infrastructure needed. Just pass a URL and get Markdown back.
Check it out: Website to Markdown for LLM and RAG on Apify Store
What It Extracts
- Page title and meta description
- Open Graph data
- Clean Markdown with headings, links, lists
- Word count
- Structured JSON output ready for your pipeline
Free to use. Built with Python. No dependencies.
What tools do you use to prepare web data for your AI models? Let me know in the comments.
Top comments (0)