If you have ever pasted a webpage into ChatGPT or Claude, you have probably noticed the output quality is inconsistent. That is because raw HTML wastes 80-90% of your context window on nav bars, ads, scripts, and layout noise.
The Problem
A typical 1,500-word blog post lives inside 50-80KB of HTML. The actual content? Maybe 6-8KB. You are paying for tokens that add zero value.
I tested 3 real pages:
- News article: 14,800 tokens raw HTML vs 2,100 clean Markdown (86% waste)
- React docs: 22,400 vs 5,800 tokens (74% waste)
- Reddit thread: 38,600 vs 6,200 tokens (84% waste)
Why Markdown?
Markdown wins because:
- Structure without noise — headings, lists, code blocks survive
- LLMs are trained on it — every GitHub repo uses Markdown
- Token efficient
My Workflow
I built Web2MD to solve this. It is a Chrome extension that converts any webpage to clean Markdown with one click. The conversion engine uses 130+ CSS selectors to strip boilerplate and has dedicated extractors for 14 platforms (YouTube subtitles, Reddit threads, GitHub READMEs, arXiv papers, etc.).
All processing happens locally in your browser — nothing is uploaded.
The Math
At GPT-4o pricing ($2.50/1M input tokens), processing 30 pages/day:
- Raw HTML: $1.50/day
- Clean Markdown: $0.30/day
- Savings: $36/month
Web2MD is free (3 conversions/day). Pro is $9/month for unlimited.
What is your current workflow for feeding web content to LLMs?
Top comments (0)