I Built an RSS Aggregator That Extracts Full Article Content (Not Just Summaries)

#webdev #python #rss #automation

Most RSS feed readers give you a 200-character summary and force you to click through to read the full article. That's useless if you're building news monitoring pipelines, AI training datasets, or content curation tools.

So I built a proper RSS Feed Aggregator that follows each article link and extracts the complete full-text content — clean, structured, and ready to use.

What It Does

Multi-feed ingestion — Point it at multiple RSS/Atom feeds simultaneously
Full-text extraction — Uses trafilatura to extract the actual article content, stripping boilerplate, ads, and navigation
Deduplication — Automatically detects and removes duplicate articles across feeds
Rich metadata — Word counts, authorship, publish dates, images, source tracking
Keyword filtering — Include/exclude articles by keywords

Example Output

Each article comes back as structured JSON:

{
  "title": "The only AI glossary you'll need this year",
  "fullContent": "...3,727 words of clean extracted text...",
  "author": "Kyle Wiggers",
  "publishedDate": "2026-07-04T10:00:00Z",
  "wordCount": 3727,
  "imageUrl": "https://...",
  "sourceFeed": "https://techcrunch.com/feed/",
  "sourceUrl": "https://techcrunch.com/2026/07/04/..."
}

Real-World Use Cases

AI/LLM Training Data — Need clean text without HTML boilerplate? This outputs publication-ready content.
News Monitoring — Aggregate dozens of feeds and get full articles, not snippets.
Content Curation — Pull from multiple sources, deduplicate, filter by keywords.
Research Pipelines — Collect articles on specific topics for analysis.

Try It

The tool is live on the Apify Store: RSS Feed Aggregator & Article Extractor

It uses pay-per-event pricing at $0.01 per article extracted. If you're on Apify's free tier ($5/mo credits), that covers ~500 articles — enough for a solid test run.

Input Parameters

Parameter	Default	Description
`feedUrls`	required	RSS/Atom feed URLs
`maxResults`	50	Maximum articles to extract
`extractContent`	true	Follow links and extract full text
`deduplicate`	true	Remove duplicate articles
`keywordFilter`	[]	Include/exclude keywords

How Full-Text Extraction Works

The actor uses trafilatura, a Python library specifically designed for web text extraction. Unlike basic regex or BeautifulSoup approaches, trafilatura:

Strips navigation, sidebars, footers, and ads
Preserves article structure (paragraphs, headings)
Handles JavaScript-rendered content
Works across 20+ languages

This means you get the actual article text — not the RSS description, not a truncated summary, but the full content as the author wrote it.

If you're working with RSS feeds or news data, give it a try. Happy to add features based on feedback — what would make this useful for your use case?