Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML.
Here's a simple approach using CheerioCrawler:
// Remove noise
$("script, style, nav, footer, header, aside, .ad, noscript").remove();
// Get main content
let text = $("article, [role=main], main, .content").first().text();
if (!text || text.length < 100) text = $("body").text();
// Clean whitespace
text = text.replace(/\s+/g, " ").trim();
Why Not Just Use body.text()?
Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content.
The Priority Order
-
<article>tag — most semantic, usually contains the main content -
[role="main"]— ARIA landmark -
<main>— HTML5 semantic element -
.content, .post-content— common CSS classes -
<body>— fallback
Output
{
"url": "https://example.com/blog/post",
"title": "The Blog Post Title",
"text": "Clean extracted text...",
"wordCount": 1450,
"characterCount": 8700
}
I built a Text Extractor tool on Apify that does this at scale — batch process multiple URLs, auto-remove navigation, configurable max length.
Free on Apify Store — search knotless_cadence text-extractor.\n\n---\n\n*More tools:* 60+ free scrapers | Reports | MCP Servers
Need data scraped or market research done? I offer web scraping ($20), market research reports ($20), and custom automation ($50+). 77 production scrapers. Hire me → or email Spinov001@gmail.com
Order custom data via Payoneer ($20)
More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Also: Neon Free Postgres | Vercel Free API | Hetzner 4x More Server
Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email me: spinov001@gmail.com*
Top comments (0)