DEV Community

MFS CORP
MFS CORP

Posted on

Build a Real-Time News Aggregator with Python, RSS, and Telegram in Under 100 Lines

Want to build your own automated news channel? Here's exactly how I did it — the complete architecture, code patterns, and lessons learned.

The Stack

  • Python 3 (stdlib only — no pip installs needed)
  • RSS feeds (free, reliable, real-time)
  • Telegram Bot API (free, unlimited messages to channels)
  • Cron (15-minute intervals)
  • SearXNG (optional: self-hosted search fallback)

Total cost: $0/month. Runs on any Linux box, VPS, or even a Raspberry Pi.

Step 1: Create Your Telegram Bot

  1. Message @BotFather on Telegram
  2. Send /newbot and follow the prompts
  3. Save your bot token
  4. Create a public channel (e.g., @YourNewsChannel)
  5. Add your bot as an admin with posting rights

Step 2: Find RSS Feeds

Most major news sites still offer RSS. Here's how to find them:

# Common RSS URL patterns:
# /feed/
# /rss/
# /rss.xml
# /feeds/rss/headlines
# /atom.xml

# Example feeds:
FEEDS = [
    ('TechCrunch', 'https://techcrunch.com/feed/'),
    ('Ars Technica', 'https://feeds.arstechnica.com/arstechnica/index'),
    ('The Verge', 'https://www.theverge.com/rss/index.xml'),
    ('Hacker News', 'https://hnrss.org/frontpage?points=100'),
]
Enter fullscreen mode Exit fullscreen mode

Pro tip: If a site doesn't have RSS, use Google News RSS:

https://news.google.com/rss/search?q=site:example.com&hl=en-US
Enter fullscreen mode Exit fullscreen mode

Step 3: The Core Engine (~60 lines)

import urllib.request
import xml.etree.ElementTree as ET
import json, hashlib, re, os
from datetime import datetime, timezone
from html import unescape

def fetch_rss(url, max_items=10):
    """Fetch and parse an RSS feed. Returns list of stories."""
    stories = []
    req = urllib.request.Request(url, headers={
        'User-Agent': 'Mozilla/5.0 (compatible; NewsBot/1.0)'
    })
    resp = urllib.request.urlopen(req, timeout=12)
    root = ET.fromstring(resp.read())

    for item in root.findall('.//item')[:max_items]:
        title = item.findtext('title', '').strip()
        link = item.findtext('link', '').strip()
        desc = item.findtext('description', '').strip()
        desc = re.sub(r'<[^>]+>', '', unescape(desc))[:200]
        pub = item.findtext('pubDate', '')

        if title and link:
            stories.append({
                'title': unescape(title),
                'url': link,
                'desc': desc,
                'pub': pub
            })
    return stories
Enter fullscreen mode Exit fullscreen mode

Step 4: Deduplication

Without dedup, you'll post the same AP/Reuters story from 6 different sources:

def story_hash(title):
    clean = re.sub(r'[^a-z0-9 ]', '', title.lower().strip())
    return hashlib.md5(clean[:80].encode()).hexdigest()[:12]

# Load previously posted stories
state = json.load(open('state.json')) if os.path.exists('state.json') else {}
posted = state.get('posted', {})

# Filter new stories
new_stories = []
for story in all_stories:
    h = story_hash(story['title'])
    if h not in posted:
        new_stories.append(story)
        posted[h] = {'title': story['title'], 'ts': datetime.now().isoformat()}
Enter fullscreen mode Exit fullscreen mode

Step 5: Freshness Filter

Only post stories from the last hour — nobody wants yesterday's news:

from datetime import timezone

def is_fresh(pub_date_str, max_hours=1):
    formats = [
        '%a, %d %b %Y %H:%M:%S %z',
        '%Y-%m-%dT%H:%M:%S%z',
        '%Y-%m-%dT%H:%M:%SZ',
    ]
    for fmt in formats:
        try:
            dt = datetime.strptime(pub_date_str.strip(), fmt)
            if dt.tzinfo is None:
                dt = dt.replace(tzinfo=timezone.utc)
            age = datetime.now(timezone.utc) - dt
            return age.total_seconds() < (max_hours * 3600)
        except ValueError:
            continue
    return True  # Can't parse = assume fresh
Enter fullscreen mode Exit fullscreen mode

Step 6: Image Extraction

Posts with images get 3-5x more engagement:

def extract_og_image(url):
    """Scrape og:image from article page."""
    try:
        req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        html = urllib.request.urlopen(req, timeout=5).read(100000)
        html = html.decode('utf-8', errors='ignore')
        match = re.search(
            r'<meta[^>]*property=["\']og:image["\'][^>]*content=["\']'
            r'(https?://[^"\']+)["\']', html
        )
        return match.group(1) if match else None
    except:
        return None
Enter fullscreen mode Exit fullscreen mode

Step 7: Post to Telegram

def post_to_telegram(token, channel, text, image_url=None):
    if image_url:
        data = json.dumps({
            'chat_id': channel,
            'photo': image_url,
            'caption': text,
            'parse_mode': 'HTML'
        }).encode()
        url = f'https://api.telegram.org/bot{token}/sendPhoto'
    else:
        data = json.dumps({
            'chat_id': channel,
            'text': text,
            'parse_mode': 'HTML'
        }).encode()
        url = f'https://api.telegram.org/bot{token}/sendMessage'

    req = urllib.request.Request(url, data=data,
        headers={'Content-Type': 'application/json'})
    resp = json.loads(urllib.request.urlopen(req, timeout=15).read())
    return resp.get('ok', False)
Enter fullscreen mode Exit fullscreen mode

Step 8: Cron It

# Run every 15 minutes
*/15 * * * * /usr/bin/python3 /path/to/news_bot.py >> /var/log/news_bot.log 2>&1
Enter fullscreen mode Exit fullscreen mode

Quality Controls I Added Later

These came from running the system for a week:

1. Content Filtering

BBC's main RSS feed includes sports, lifestyle, and entertainment. Filter by category or use specific feeds:

EXCLUDE = re.compile(r'football|soccer|rugby|cricket|recipe|horoscope', re.I)
if EXCLUDE.search(title):
    continue  # Skip off-topic
Enter fullscreen mode Exit fullscreen mode

2. Fuzzy Deduplication

Same story from AP appears as slightly different headlines on CNN, BBC, Guardian:

def is_near_duplicate(new_title, existing_titles, threshold=0.7):
    words_new = set(new_title.lower().split())
    for existing in existing_titles:
        words_ex = set(existing.lower().split())
        overlap = len(words_new & words_ex) / max(len(words_new), len(words_ex))
        if overlap >= threshold:
            return True
    return False
Enter fullscreen mode Exit fullscreen mode

3. Default Images

Some RSS feeds don't include images. Always have a fallback:

DEFAULT_IMAGES = {
    'tech': 'https://images.unsplash.com/photo-1518770660439-4636190af475?w=800',
    'world': 'https://images.unsplash.com/photo-1451187580459-43490279c0fa?w=800',
}
image = story.get('image') or extract_og_image(url) or DEFAULT_IMAGES[category]
Enter fullscreen mode Exit fullscreen mode

Results

Running 4 channels with this architecture:

Total infrastructure cost: $0/month (runs on existing server).

What's Next

  • AI-generated summaries and analysis layer
  • Premium tier with sentiment analysis
  • More niche channels (sports, science, gaming)
  • Telegram Mini App for custom news feeds

Questions? Drop a comment. Want to see the full source? Check out MFS Corp on GitHub.

Part of the Building MFS Corp series — documenting how we're building an AI-powered company from scratch.

Top comments (0)