DEV Community

Cover image for Extract Clean Text from Any Webpage for RAG Pipelines
Alex Spinov
Alex Spinov

Posted on • Edited on

Extract Clean Text from Any Webpage for RAG Pipelines

Building RAG (Retrieval-Augmented Generation) systems? You need clean text, not raw HTML.

Here's a simple approach using CheerioCrawler:

// Remove noise
$("script, style, nav, footer, header, aside, .ad, noscript").remove();

// Get main content
let text = $("article, [role=main], main, .content").first().text();
if (!text || text.length < 100) text = $("body").text();

// Clean whitespace
text = text.replace(/\s+/g, " ").trim();
Enter fullscreen mode Exit fullscreen mode

Why Not Just Use body.text()?

Raw body text includes navigation menus, footer links, cookie banners, and ad text. For RAG, you want ONLY the main content.

The Priority Order

  1. <article> tag — most semantic, usually contains the main content
  2. [role="main"] — ARIA landmark
  3. <main> — HTML5 semantic element
  4. .content, .post-content — common CSS classes
  5. <body> — fallback

Output

{
  "url": "https://example.com/blog/post",
  "title": "The Blog Post Title",
  "text": "Clean extracted text...",
  "wordCount": 1450,
  "characterCount": 8700
}
Enter fullscreen mode Exit fullscreen mode

I built a Text Extractor tool on Apify that does this at scale — batch process multiple URLs, auto-remove navigation, configurable max length.

Free on Apify Store — search knotless_cadence text-extractor.\n\n---\n\n*More tools:* 60+ free scrapers | Reports | MCP Servers


Need data scraped or market research done? I offer web scraping ($20), market research reports ($20), and custom automation ($50+). 77 production scrapers. Hire me → or email Spinov001@gmail.com

Order custom data via Payoneer ($20)


More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Also: Neon Free Postgres | Vercel Free API | Hetzner 4x More Server


Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email me: spinov001@gmail.com*

Top comments (0)