DEV Community

Juan Triviño
Juan Triviño

Posted on

Free Website to Markdown Converter for LLM and RAG Pipelines

The Problem

If you are building AI applications with LLMs, you know the pain: raw HTML is useless for training data. You need clean, structured Markdown.

Most solutions like Firecrawl or Crawl4AI require setup, dependencies, and often paid plans.

The Manual Way

You could write your own parser:

import re
import urllib.request

def html_to_markdown(url):
    html = urllib.request.urlopen(url).read().decode()
    # Remove scripts, styles
    html = re.sub(r"<script.*?</script>", "", html, flags=re.DOTALL)
    html = re.sub(r"<style.*?</style>", "", html, flags=re.DOTALL)
    # Convert headings
    for i in range(6, 0, -1):
        html = re.sub(r"<h%d.*?>(.*?)</h%d>" % (i,i), "#"*i + r" 1
", html)
    # Strip remaining tags
    return re.sub(r"<[^>]+>", "", html).strip()
Enter fullscreen mode Exit fullscreen mode

But this breaks on complex pages, misses metadata, and requires constant maintenance.

The Automated Way

I built a free tool that handles all of this automatically. It extracts clean Markdown from any URL, preserves document structure, headings, links, and metadata.

Perfect for:

  • RAG pipelines - feed clean text into vector databases
  • LLM fine-tuning - structured training data from any website
  • Content analysis - extract and analyze web content at scale

The tool runs on Apify, so no infrastructure needed. Just pass a URL and get Markdown back.

Check it out: Website to Markdown for LLM and RAG on Apify Store

What It Extracts

  • Page title and meta description
  • Open Graph data
  • Clean Markdown with headings, links, lists
  • Word count
  • Structured JSON output ready for your pipeline

Free to use. Built with Python. No dependencies.


What tools do you use to prepare web data for your AI models? Let me know in the comments.

Top comments (0)