NexGenData

Posted on Jul 2 • Originally published at thenextgennexus.com

Building a News and Content Aggregator with AI: From Scraping to Smart Curation

#ai #webscraping #api #python

Why Build a Custom News Aggregator?

Google Alerts and Feedly are fine for casual monitoring, but they fail when you need comprehensive, real-time coverage of niche topics. Commercial solutions like Meltwater or Cision charge $500-2000+/month for what you can build yourself with Python and a few APIs.

This guide walks you through building a production-grade content aggregation system that scrapes news sites, blogs, forums, and social platforms — then uses AI to deduplicate, summarize, and surface the most relevant content automatically.

Architecture Overview

The system has four stages: collection, processing, intelligence, and delivery. Each stage runs independently, connected through a message queue for reliability.

Stage 1: Multi-Source Content Collection

The foundation is a fleet of specialized scrapers, each optimized for a different source type. RSS feeds are the easiest starting point — most news sites still publish them, and they're structured data you don't need to parse from HTML.

For sites without RSS feeds, you need web scrapers. The key is building source-specific extractors that handle each site's DOM structure, JavaScript rendering requirements, and pagination patterns. Using Apify actors simplifies this significantly — pre-built scrapers handle the infrastructure while you focus on extraction logic.

Critical sources to include: mainstream news (Reuters, AP, Bloomberg), tech news (TechCrunch, The Verge, Ars Technica), community platforms (Hacker News, Reddit, Dev.to), and industry-specific blogs. Our Hacker News Scraper and Web Content Crawler handle two of the most valuable sources out of the box.

Stage 2: Content Processing Pipeline

Raw scraped content is messy. The processing pipeline handles: HTML stripping and text extraction, language detection, duplicate and near-duplicate detection (using MinHash/LSH for fuzzy matching), entity extraction (people, companies, products mentioned), and metadata normalization (dates, authors, sources).

Deduplication is the hardest problem. The same story gets published across dozens of outlets with slight rewrites. Simple URL matching catches exact duplicates, but you need semantic similarity (cosine similarity on TF-IDF vectors or sentence embeddings) to catch paraphrased versions.


    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np

    def find_duplicates(articles, threshold=0.85):
        """Find near-duplicate articles using TF-IDF cosine similarity."""
        texts = [a['title'] + ' ' + a['text'][:500] for a in articles]
        tfidf = TfidfVectorizer(max_features=10000, stop_words='english')
        matrix = tfidf.fit_transform(texts)
        similarities = cosine_similarity(matrix)

        duplicates = []
        for i in range(len(articles)):
            for j in range(i+1, len(articles)):
                if similarities[i][j] > threshold:
                    duplicates.append((i, j, similarities[i][j]))
        return duplicates

Stage 3: AI-Powered Intelligence Layer

This is where AI transforms raw content into actionable intelligence. The intelligence layer handles three key tasks: summarization, topic classification, and relevance scoring.

For summarization, LLMs (GPT-4, Claude, or open-source models via Ollama) generate concise summaries of each article. The trick is providing good prompts that extract the key facts without hallucinating details.

Topic classification uses a combination of keyword matching and LLM-based categorization. Start with a taxonomy of topics relevant to your domain, then train a classifier (or use zero-shot classification with sentence-transformers) to bucket each article.

Relevance scoring is the most valuable piece. A scoring model that considers recency, source authority, topic match, entity relevance, and engagement signals (social shares, comment count) produces a single relevance score for each article. This determines what surfaces in your daily digest.

Stage 4: Delivery and Alerting

The final stage delivers curated content through multiple channels: daily email digests (top 10 articles, AI-summarized), Slack/Discord alerts for breaking news (relevance score above threshold), RSS feed output (for consumption by other tools), and a web dashboard for browsing the full archive.

The daily digest is the most popular output format. Structure it with sections: "Top Stories" (highest relevance score), "Trending Topics" (emerging themes), "Competitor Watch" (mentions of tracked companies), and "Deep Dives" (long-form content worth reading in full).

Production Considerations

Running this in production requires attention to several details. Rate limiting is essential — don't hammer news sites with requests. Respect robots.txt and use reasonable delays between requests. Storage grows fast: plan for 10-50 GB/month of text data depending on how many sources you monitor. Use PostgreSQL with full-text search for efficient querying.

Error handling matters more than you think. Sources go down, change their HTML structure, add CAPTCHAs, or block your IP. Build monitoring that alerts you when a source stops producing content, and design your scrapers to fail gracefully.

Cost Breakdown

Running a production news aggregator monitoring 100+ sources costs approximately: $30-50/month for scraping infrastructure (Apify, proxies), $20-40/month for LLM API calls (summarization, classification), $10-20/month for hosting (PostgreSQL, web server), and $5-10/month for email delivery. Total: $65-120/month — versus $500-2000/month for commercial solutions.

Getting Started

Start small: pick 10 sources in your niche, set up RSS scraping, and build the deduplication pipeline. Add AI summarization once you have reliable data flowing. Scale sources gradually — each new source type (social media, forums, newsletters) requires its own extraction logic.

The nexgendata actor library on Apify provides pre-built scrapers for many common sources. Combine these with a Python processing pipeline and an LLM API for the intelligence layer, and you'll have a production aggregator running within a weekend.

Tools Referenced

Web Content Crawler — General-purpose article extraction
Hacker News Scraper — Tech community monitoring
AI Sentiment Analyzer — Content sentiment scoring
Full nexgendata toolkit — 50+ data collection actors

Want pre-built data packs instead of building scrapers? Check our data products on Gumroad for ready-to-use datasets updated weekly.

About the Author

The Next Gen Nexus covers AI agents, automation, and web data — practical guides for developers, analysts, and businesses working with data at scale.

DEV Community