DEV Community

Aman Sachan
Aman Sachan

Posted on • Originally published at github.com

How I read 600 RSS feeds every morning in 3 minutes (pure Python, no framework)

Every morning I was opening 8-10 tabs — TOI, The Hindu, Indian Express, NDTV, Moneycontrol, Scroll, The Wire, FT, BBC. By the time I finished, it was 9 AM and I'd lost an hour to context-switching. So I built something to fix it.

What I built

A scheduled agent that polls ~600 RSS sources every morning, deduplicates cross-posted articles, categorizes them into 8 buckets, and emails me a clean brief before 8 AM IST.

The full skill is 293 lines of Python — pure stdlib (urllib, xml.etree, re, email.utils). No feedparser. No framework.

The architecture

17 Google News RSS queries + 30 direct publisher feeds
        |
        v
[fetch_all]  -- parallel via concurrent.futures
        |
        v
[dedupe]  -- 3-pass: normalize title, URL quality, jaccard token overlap
        |
        v
[categorize]  -- 8 buckets, rule-based keyword match
        |
        v
[render_markdown]  -- top stories + categorized sections
        |
        v
[send_email]  -- Gmail SMTP, plain text + optional PDF
Enter fullscreen mode Exit fullscreen mode

The interesting parts

Three-pass dedupe

Google News reposts the same story from 5+ outlets. Single-pass title matching misses the tricky ones (reworded headlines, different source attribution).

def normalize_title(t):
    # "Modi visits US - The Hindu" -> "modi visits us"
    t = re.sub(r'\s*-\s*[A-Z][a-zA-Z\s]+$', '', t)
    t = t.lower().strip()
    t = re.sub(r'[^\w\s]', '', t)
    return t

def url_quality(url):
    if 'news.google.com' in url: return 0.2
    if 'toi' in url or 'thehindu' in url: return 1.0
    return 0.5

def is_duplicate(a, b):
    return jaccard(tokenize(a.title), tokenize(b.title)) > 0.65
Enter fullscreen mode Exit fullscreen mode

Rule-based categorization beats LLM here

  • 8 buckets (Politics, Economy, World, Business, Tech, Defence, Sports, Science)
  • ~30 keywords per bucket
  • Matches in <1ms per item
  • No API cost, no rate limits
  • Predictable: "RBI rate decision" -> Economy, always

Today's actual brief (5 Jun 2026)

  • 48 stories from 600+ raw pulls (92% dedupe rate)
  • 7 top stories (FT, Al Jazeera, NDTV, India Today)
  • ~600 words, ~3 min read
  • Landed in inbox at 7:00 AM IST

Why this beats the alternatives

  • Apple News / Google Discover: too US-centric, weak India coverage
  • Twitter lists: noisy, no dedupe, no archival
  • Manual scanning: 1+ hour/day, miss stuff
  • Other RSS readers: they aggregate but don't dedupe or categorize

Stack

  • Python 3.10+, stdlib only
  • Runs as a scheduled agent on Zo Computer
  • Sends via Gmail SMTP
  • ~293 LOC total, single file

GitHub: https://github.com/AmSach/india-daily

Happy to answer questions on the dedupe logic or categorization rules. If anyone wants a different regional feed set (US, EU, SEA), drop a comment.

Top comments (0)