Free Website to Markdown Converter for LLM and RAG Pipelines

#ai #llm #python #webdev

The Problem

If you are building AI applications with LLMs, you know the pain: raw HTML is useless for training data. You need clean, structured Markdown.

Most solutions like Firecrawl or Crawl4AI require setup, dependencies, and often paid plans.

The Manual Way

You could write your own parser:

import re
import urllib.request

def html_to_markdown(url):
    html = urllib.request.urlopen(url).read().decode()
    # Remove scripts, styles
    html = re.sub(r"<script.*?</script>", "", html, flags=re.DOTALL)
    html = re.sub(r"<style.*?</style>", "", html, flags=re.DOTALL)
    # Convert headings
    for i in range(6, 0, -1):
        html = re.sub(r"<h%d.*?>(.*?)</h%d>" % (i,i), "#"*i + r" 1
", html)
    # Strip remaining tags
    return re.sub(r"<[^>]+>", "", html).strip()

But this breaks on complex pages, misses metadata, and requires constant maintenance.

The Automated Way

I built a free tool that handles all of this automatically. It extracts clean Markdown from any URL, preserves document structure, headings, links, and metadata.

Perfect for: