DEV Community

zephyrooo
zephyrooo

Posted on

How to Convert Any Webpage to Clean Markdown for AI Workflows

If you have ever pasted a webpage into ChatGPT or Claude, you have probably noticed the output quality is inconsistent. That is because raw HTML wastes 80-90% of your context window on nav bars, ads, scripts, and layout noise.

The Problem

A typical 1,500-word blog post lives inside 50-80KB of HTML. The actual content? Maybe 6-8KB. You are paying for tokens that add zero value.

I tested 3 real pages:

  • News article: 14,800 tokens raw HTML vs 2,100 clean Markdown (86% waste)
  • React docs: 22,400 vs 5,800 tokens (74% waste)
  • Reddit thread: 38,600 vs 6,200 tokens (84% waste)

Why Markdown?

Markdown wins because:

  • Structure without noise — headings, lists, code blocks survive
  • LLMs are trained on it — every GitHub repo uses Markdown
  • Token efficient

My Workflow

I built Web2MD to solve this. It is a Chrome extension that converts any webpage to clean Markdown with one click. The conversion engine uses 130+ CSS selectors to strip boilerplate and has dedicated extractors for 14 platforms (YouTube subtitles, Reddit threads, GitHub READMEs, arXiv papers, etc.).

All processing happens locally in your browser — nothing is uploaded.

The Math

At GPT-4o pricing ($2.50/1M input tokens), processing 30 pages/day:

  • Raw HTML: $1.50/day
  • Clean Markdown: $0.30/day
  • Savings: $36/month

Web2MD is free (3 conversions/day). Pro is $9/month for unlimited.

What is your current workflow for feeding web content to LLMs?

Top comments (0)