An AI News Aggregator That Clusters 30+ Sources in Real-Time — Here's How It Works

#ai #opensource #webdev #productivity

Hey Dev community,

Sharing a project: Best AI News Today — a real-time AI news aggregator that pulls from 30+ sources, clusters duplicate stories, and scores everything by quality. It's live at https://best-ai.news
.

The Problem

If you're a developer working with AI (and who isn't these days), multiple sources get checked daily — Hacker News, Reddit, tech blogs, lab announcements.

The same story appears across several platforms.
Often, two or three versions are read before it becomes clear it's the same announcement.

The Solution

The aggregator fetches from RSS feeds, Reddit JSON API, and Medium RSS every 15 minutes.

Articles go through a pipeline:

– Fetch from 30+ sources (RSS, Reddit OAuth, Medium via rss2json proxy)
– Score for quality (0–100 based on source tier, how recent it is, content depth, engagement)
– Cluster using Union-Find on title similarity (Jaccard coefficient ≥ 0.5)
– Detect breaking news (3+ sources within 4 hours)
– Match entities using compiled regex patterns (65+ brands, 40+ models, 50+ people)

Tech Stack

Frontend: React 19 + Vite + Tailwind 4 + Framer Motion + Three.js
Backend: Express 5 + TypeScript + SQLite (WAL mode)
Search: Multi-provider (Google News, Bing, Reddit, HN, Medium) with deduplication
SEO: Playwright-based dynamic rendering for bots + JSON-LD + llms.txt for AI crawlers
Infra: Single Hetzner VPS, PM2, Nginx

Some interesting technical decisions

Entity pattern matching on the client
~195 regex patterns are compiled once via useMemo, with results cached per story ID in a useRef(Map). ~10K regex tests complete in <1ms.

In-memory caching with stale-while-revalidate
Feed data is stored as pre-stringified JSON in memory. This removes SQLite reads and JSON.parse() per request. Average response time: ~6ms.

Prerender pre-warming
A Playwright instance renders 600+ pages before bots arrive. Strategy: batch 10 API warm calls in parallel, then render 3 pages at a time.

llms.txt for AI bots
/llms.txt and /llms-full.txt provide structured markdown indexes for crawlers like ClaudeBot, GPTBot, and PerplexityBot. Similar concept to robots.txt, but adapted for LLMs.

prefers-reduced-motion
Accessibility support ensures animations are disabled when users request reduced motion. This includes CSS, Framer Motion, and smooth scrolling behavior.

Entity System

Each AI brand, model, product, hardware platform, institution, and key figure includes:

– Regex patterns for matching (+ negative patterns to reduce false positives)
– Dedicated page with matched stories and multi-provider background search
– JSON-LD structured data (Organization, Person, SoftwareApplication schemas)
– Dynamic SEO titles and descriptions