DEV Community

Алексей Спинов
Алексей Спинов

Posted on

Extract Clean Text from Any Webpage for RAG Pipelines

Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML.

Here's a simple approach using CheerioCrawler:

// Remove noise
$("script, style, nav, footer, header, aside, .ad, noscript").remove();

// Get main content
let text = $("article, [role=main], main, .content").first().text();
if (!text || text.length < 100) text = $("body").text();

// Clean whitespace
text = text.replace(/\s+/g, " ").trim();
Enter fullscreen mode Exit fullscreen mode

Why Not Just Use body.text()?

Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content.

The Priority Order

  1. <article> tag — most semantic, usually contains the main content
  2. [role="main"] — ARIA landmark
  3. <main> — HTML5 semantic element
  4. .content, .post-content — common CSS classes
  5. <body> — fallback

Output

{
  "url": "https://example.com/blog/post",
  "title": "The Blog Post Title",
  "text": "Clean extracted text...",
  "wordCount": 1450,
  "characterCount": 8700
}
Enter fullscreen mode Exit fullscreen mode

I built a Text Extractor tool on Apify that does this at scale — batch process multiple URLs, auto-remove navigation, configurable max length.

Free on Apify Store — search knotless_cadence text-extractor.

Top comments (0)