Every morning I was opening 8-10 tabs — TOI, The Hindu, Indian Express, NDTV, Moneycontrol, Scroll, The Wire, FT, BBC. By the time I finished, it was 9 AM and I'd lost an hour to context-switching. So I built something to fix it.
What I built
A scheduled agent that polls ~600 RSS sources every morning, deduplicates cross-posted articles, categorizes them into 8 buckets, and emails me a clean brief before 8 AM IST.
The full skill is 293 lines of Python — pure stdlib (urllib, xml.etree, re, email.utils). No feedparser. No framework.
The architecture
17 Google News RSS queries + 30 direct publisher feeds
|
v
[fetch_all] -- parallel via concurrent.futures
|
v
[dedupe] -- 3-pass: normalize title, URL quality, jaccard token overlap
|
v
[categorize] -- 8 buckets, rule-based keyword match
|
v
[render_markdown] -- top stories + categorized sections
|
v
[send_email] -- Gmail SMTP, plain text + optional PDF
The interesting parts
Three-pass dedupe
Google News reposts the same story from 5+ outlets. Single-pass title matching misses the tricky ones (reworded headlines, different source attribution).
def normalize_title(t):
# "Modi visits US - The Hindu" -> "modi visits us"
t = re.sub(r'\s*-\s*[A-Z][a-zA-Z\s]+$', '', t)
t = t.lower().strip()
t = re.sub(r'[^\w\s]', '', t)
return t
def url_quality(url):
if 'news.google.com' in url: return 0.2
if 'toi' in url or 'thehindu' in url: return 1.0
return 0.5
def is_duplicate(a, b):
return jaccard(tokenize(a.title), tokenize(b.title)) > 0.65
Rule-based categorization beats LLM here
- 8 buckets (Politics, Economy, World, Business, Tech, Defence, Sports, Science)
- ~30 keywords per bucket
- Matches in <1ms per item
- No API cost, no rate limits
- Predictable: "RBI rate decision" -> Economy, always
Today's actual brief (5 Jun 2026)
- 48 stories from 600+ raw pulls (92% dedupe rate)
- 7 top stories (FT, Al Jazeera, NDTV, India Today)
- ~600 words, ~3 min read
- Landed in inbox at 7:00 AM IST
Why this beats the alternatives
- Apple News / Google Discover: too US-centric, weak India coverage
- Twitter lists: noisy, no dedupe, no archival
- Manual scanning: 1+ hour/day, miss stuff
- Other RSS readers: they aggregate but don't dedupe or categorize
Stack
- Python 3.10+, stdlib only
- Runs as a scheduled agent on Zo Computer
- Sends via Gmail SMTP
- ~293 LOC total, single file
GitHub: https://github.com/AmSach/india-daily
Happy to answer questions on the dedupe logic or categorization rules. If anyone wants a different regional feed set (US, EU, SEA), drop a comment.
Top comments (0)