Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML.
Here's a simple approach using CheerioCrawler:
// Remove noise
$("script, style, nav, footer, header, aside, .ad, noscript").remove();
// Get main content
let text = $("article, [role=main], main, .content").first().text();
if (!text || text.length < 100) text = $("body").text();
// Clean whitespace
text = text.replace(/\s+/g, " ").trim();
Why Not Just Use body.text()?
Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content.
The Priority Order
-
<article>tag — most semantic, usually contains the main content -
[role="main"]— ARIA landmark -
<main>— HTML5 semantic element -
.content, .post-content— common CSS classes -
<body>— fallback
Output
{
"url": "https://example.com/blog/post",
"title": "The Blog Post Title",
"text": "Clean extracted text...",
"wordCount": 1450,
"characterCount": 8700
}
I built a Text Extractor tool on Apify that does this at scale — batch process multiple URLs, auto-remove navigation, configurable max length.
Free on Apify Store — search knotless_cadence text-extractor.
Top comments (0)