Most RSS feed readers give you a 200-character summary and force you to click through to read the full article. That's useless if you're building news monitoring pipelines, AI training datasets, or content curation tools.
So I built a proper RSS Feed Aggregator that follows each article link and extracts the complete full-text content — clean, structured, and ready to use.
What It Does
- Multi-feed ingestion — Point it at multiple RSS/Atom feeds simultaneously
- Full-text extraction — Uses trafilatura to extract the actual article content, stripping boilerplate, ads, and navigation
- Deduplication — Automatically detects and removes duplicate articles across feeds
- Rich metadata — Word counts, authorship, publish dates, images, source tracking
- Keyword filtering — Include/exclude articles by keywords
Example Output
Each article comes back as structured JSON:
{
"title": "The only AI glossary you'll need this year",
"fullContent": "...3,727 words of clean extracted text...",
"author": "Kyle Wiggers",
"publishedDate": "2026-07-04T10:00:00Z",
"wordCount": 3727,
"imageUrl": "https://...",
"sourceFeed": "https://techcrunch.com/feed/",
"sourceUrl": "https://techcrunch.com/2026/07/04/..."
}
Real-World Use Cases
- AI/LLM Training Data — Need clean text without HTML boilerplate? This outputs publication-ready content.
- News Monitoring — Aggregate dozens of feeds and get full articles, not snippets.
- Content Curation — Pull from multiple sources, deduplicate, filter by keywords.
- Research Pipelines — Collect articles on specific topics for analysis.
Try It
The tool is live on the Apify Store: RSS Feed Aggregator & Article Extractor
It uses pay-per-event pricing at $0.01 per article extracted. If you're on Apify's free tier ($5/mo credits), that covers ~500 articles — enough for a solid test run.
Input Parameters
| Parameter | Default | Description |
|---|---|---|
feedUrls |
required | RSS/Atom feed URLs |
maxResults |
50 | Maximum articles to extract |
extractContent |
true | Follow links and extract full text |
deduplicate |
true | Remove duplicate articles |
keywordFilter |
[] | Include/exclude keywords |
How Full-Text Extraction Works
The actor uses trafilatura, a Python library specifically designed for web text extraction. Unlike basic regex or BeautifulSoup approaches, trafilatura:
- Strips navigation, sidebars, footers, and ads
- Preserves article structure (paragraphs, headings)
- Handles JavaScript-rendered content
- Works across 20+ languages
This means you get the actual article text — not the RSS description, not a truncated summary, but the full content as the author wrote it.
If you're working with RSS feeds or news data, give it a try. Happy to add features based on feedback — what would make this useful for your use case?
Top comments (0)