DEV Community

Oaida Adrian
Oaida Adrian

Posted on • Originally published at apify.com

I Built an RSS Aggregator That Extracts Full Article Content (Not Just Summaries)

Most RSS feed readers give you a 200-character summary and force you to click through to read the full article. That's useless if you're building news monitoring pipelines, AI training datasets, or content curation tools.

So I built a proper RSS Feed Aggregator that follows each article link and extracts the complete full-text content — clean, structured, and ready to use.

What It Does

  • Multi-feed ingestion — Point it at multiple RSS/Atom feeds simultaneously
  • Full-text extraction — Uses trafilatura to extract the actual article content, stripping boilerplate, ads, and navigation
  • Deduplication — Automatically detects and removes duplicate articles across feeds
  • Rich metadata — Word counts, authorship, publish dates, images, source tracking
  • Keyword filtering — Include/exclude articles by keywords

Example Output

Each article comes back as structured JSON:

{
  "title": "The only AI glossary you'll need this year",
  "fullContent": "...3,727 words of clean extracted text...",
  "author": "Kyle Wiggers",
  "publishedDate": "2026-07-04T10:00:00Z",
  "wordCount": 3727,
  "imageUrl": "https://...",
  "sourceFeed": "https://techcrunch.com/feed/",
  "sourceUrl": "https://techcrunch.com/2026/07/04/..."
}
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

  1. AI/LLM Training Data — Need clean text without HTML boilerplate? This outputs publication-ready content.
  2. News Monitoring — Aggregate dozens of feeds and get full articles, not snippets.
  3. Content Curation — Pull from multiple sources, deduplicate, filter by keywords.
  4. Research Pipelines — Collect articles on specific topics for analysis.

Try It

The tool is live on the Apify Store: RSS Feed Aggregator & Article Extractor

It uses pay-per-event pricing at $0.01 per article extracted. If you're on Apify's free tier ($5/mo credits), that covers ~500 articles — enough for a solid test run.

Input Parameters

Parameter Default Description
feedUrls required RSS/Atom feed URLs
maxResults 50 Maximum articles to extract
extractContent true Follow links and extract full text
deduplicate true Remove duplicate articles
keywordFilter [] Include/exclude keywords

How Full-Text Extraction Works

The actor uses trafilatura, a Python library specifically designed for web text extraction. Unlike basic regex or BeautifulSoup approaches, trafilatura:

  • Strips navigation, sidebars, footers, and ads
  • Preserves article structure (paragraphs, headings)
  • Handles JavaScript-rendered content
  • Works across 20+ languages

This means you get the actual article text — not the RSS description, not a truncated summary, but the full content as the author wrote it.


If you're working with RSS feeds or news data, give it a try. Happy to add features based on feedback — what would make this useful for your use case?

Top comments (0)