DEV Community

Cover image for How to Convert Webpages into Clean Markdown for LLMs (in 5ms)
Ibrahim Abdulmajid
Ibrahim Abdulmajid

Posted on

How to Convert Webpages into Clean Markdown for LLMs (in 5ms)

If you've ever tried feeding raw web pages into an LLM (like GPT-4 or Claude) to build a chatbot, search assistant, or RAG (Retrieval-Augmented Generation) pipeline, you quickly run into a major problem: HTML is incredibly noisy.

A typical web page is packed with layout clutter:

  • Navigation bars and footers
  • Advertisement containers
  • Tracking scripts and stylesheets
  • Cookie consent banners and SVG icons

Feeding all that raw markup into an LLM wastes up to 80% of your token window on junk, driving up your API costs and confusing the model.

In this guide, we'll look at how to strip HTML noise and extract clean, semantic markdown ready for LLM processing in under 5 milliseconds.


The Clean Extraction Pipeline

To get clean text, a request needs to go through a multi-stage sanitizer:

  1. Custom UA Fetching: Retrieve the target page without getting blocked.
  2. Boilerplate Stripping: Remove non-content tags like <script>, <style>, <nav>, <header>, <footer>, <form>, and social widgets.
  3. Core Element Focus: Isolate the primary text container (prioritizing <article> or <main> over <body>).
  4. Markdown Translation: Convert the clean HTML elements into semantic Markdown syntax (headers, lists, tables).

Implementation

We can do this easily using a lightweight edge microservice. Here is how you call it in your code using Javascript or Python:

Node.js (Fetch)

async function scrapeCleanMarkdown(targetUrl) {
  const apiKey = 'YOUR_RAPIDAPI_KEY'; // Get your key from RapidAPI
  const host = 'ai-web-scraper-markdown-extractor.p.rapidapi.com';

  // We set mode=standard to keep image links, or text_only to drop them
  const url = `https://${host}/scraper?url=${encodeURIComponent(targetUrl)}&mode=standard`;

  const response = await fetch(url, {
    method: 'GET',
    headers: {
      'x-rapidapi-key': apiKey,
      'x-rapidapi-host': host
    }
  });

  const result = await response.json();

  if (result.success) {
    console.log("Estimated LLM Tokens saved:", result.stats.estimated_llm_tokens);
    return result.markdown; 
  } else {
    throw new Error(result.error);
  }
}

// Usage example:
scrapeCleanMarkdown('https://example.com')
  .then(markdown => console.log(markdown))
  .catch(err => console.error(err));
Enter fullscreen mode Exit fullscreen mode

Python (Request)

import requests

def scrape_clean_markdown(target_url):
    url = "https://ai-web-scraper-markdown-extractor.p.rapidapi.com/scraper"
    querystring = {"url": target_url, "mode": "standard"}

    headers = {
        "x-rapidapi-key": "YOUR_RAPIDAPI_KEY",
        "x-rapidapi-host": "ai-web-scraper-markdown-extractor.p.rapidapi.com"
    }

    response = requests.get(url, headers=headers, params=querystring)
    data = response.json()

    if data.get("success"):
        return data.get("markdown")
    else:
        raise Exception(data.get("error", "Failed to scrape"))

# Usage:
# print(scrape_clean_markdown("https://example.com"))
Enter fullscreen mode Exit fullscreen mode

Live Demo & Testing

I built a complete playground dashboard where you can run this scraper (and 14 other developer microservices) live directly in your browser with zero key setup:

πŸ‘‰ Try the Live Interactive Playground

If you want to grab your API keys and integrate it into your production applications, you can find the listing page on RapidAPI:

πŸ‘‰ Subscribe on RapidAPI

How are you cleaning web content for your AI integration flows? Let me know in the comments below!

Top comments (0)