Extract Clean Text from Any Webpage for RAG Pipelines

#ai #rag #javascript #webdev

Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML.

Here's a simple approach using CheerioCrawler:

// Remove noise
$("script, style, nav, footer, header, aside, .ad, noscript").remove();

// Get main content
let text = $("article, [role=main], main, .content").first().text();
if (!text || text.length < 100) text = $("body").text();

// Clean whitespace
text = text.replace(/\s+/g, " ").trim();

Why Not Just Use `body.text()`?

Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content.

The Priority Order

<article> tag — most semantic, usually contains the main content
[role="main"] — ARIA landmark
<main> — HTML5 semantic element
.content, .post-content — common CSS classes
<body> — fallback

Output

{
  "url": "https://example.com/blog/post",
  "title": "The Blog Post Title",
  "text": "Clean extracted text...",
  "wordCount": 1450,
  "characterCount": 8700
}

I built a Text Extractor tool on Apify that does this at scale — batch process multiple URLs, auto-remove navigation, configurable max length.

Free on Apify Store — search knotless_cadence text-extractor.\n\n---\n\n*More tools:* 60+ free scrapers | Reports | MCP Servers

Need data scraped or market research done? I offer web scraping ($20), market research reports ($20), and custom automation ($50+). 77 production scrapers. Hire me → or email Spinov001@gmail.com

Order custom data via Payoneer ($20)

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email me: spinov001@gmail.com*