agenthustler

Posted on Mar 26

Scraping Hacker News: Building a Tech Trend Tracker

#python #tutorial #webdev #programming

Hacker News is one of the best sources for tracking emerging tech trends. In this guide, we'll build a Python scraper that monitors HN stories and identifies trending topics over time.

Why Scrape Hacker News?

HN's official API is great, but it doesn't give you trend analysis out of the box. By scraping and aggregating data, you can:

Track which technologies gain momentum week over week
Identify emerging frameworks before they go mainstream
Monitor sentiment around specific tools or companies

Setting Up the Scraper

First, install the dependencies:

pip install requests beautifulsoup4 pandas

Fetching Top Stories via API

HN provides a public API. Let's start there:

import requests
import time
from collections import Counter
from datetime import datetime

def get_top_stories(limit=100):
    url = "https://hacker-news.firebaseio.com/v0/topstories.json"
    story_ids = requests.get(url).json()[:limit]
    stories = []

    for sid in story_ids:
        item = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{sid}.json"
        ).json()
        if item and item.get("title"):
            stories.append({
                "title": item["title"],
                "score": item.get("score", 0),
                "url": item.get("url", ""),
                "time": datetime.fromtimestamp(item["time"]),
                "comments": item.get("descendants", 0)
            })
        time.sleep(0.1)
    return stories

Extracting Keywords for Trend Analysis

import re
from collections import Counter

STOP_WORDS = {"the", "a", "is", "in", "to", "and", "of", "for", "on", "with", "how", "why"}

def extract_trends(stories, top_n=20):
    words = []
    for story in stories:
        tokens = re.findall(r'[A-Za-z]+', story["title"].lower())
        words.extend([w for w in tokens if w not in STOP_WORDS and len(w) > 2])
    return Counter(words).most_common(top_n)

stories = get_top_stories(200)
trends = extract_trends(stories)
for word, count in trends:
    print(f"{word}: {count} mentions")

Building a Daily Tracker

import json, os

TRACKER_FILE = "hn_trends.json"

def update_tracker(trends):
    history = {}
    if os.path.exists(TRACKER_FILE):
        with open(TRACKER_FILE) as f:
            history = json.load(f)

    today = datetime.now().strftime("%Y-%m-%d")
    history[today] = dict(trends)

    with open(TRACKER_FILE, "w") as f:
        json.dump(history, f, indent=2)
    return history

def detect_rising_trends(history, days=7):
    dates = sorted(history.keys())[-days:]
    if len(dates) < 2:
        return []
    recent = history[dates[-1]]
    older = history[dates[0]]
    rising = []
    for word, count in recent.items():
        old_count = older.get(word, 0)
        if count > old_count * 1.5 and count >= 3:
            rising.append((word, old_count, count))
    return sorted(rising, key=lambda x: x[2] - x[1], reverse=True)

Scaling with Proxy Services

When you run this tracker continuously, you'll want to rotate IPs to avoid rate limits. ScraperAPI handles proxy rotation automatically:

def fetch_with_proxy(url):
    proxy_url = f"http://api.scraperapi.com?api_key=YOUR_KEY&url={url}"
    return requests.get(proxy_url).json()

For high-volume scraping, ThorData provides residential proxies that are perfect for sustained monitoring.

Storing and Visualizing Results

import pandas as pd

def trends_to_dataframe(history):
    records = []
    for date, keywords in history.items():
        for word, count in keywords.items():
            records.append({"date": date, "keyword": word, "count": count})
    return pd.DataFrame(records)

df = trends_to_dataframe(history)
pivot = df.pivot_table(index="date", columns="keyword", values="count", fill_value=0)
print(pivot.tail(7))

Monitoring Your Scraper

Use ScrapeOps to monitor your scraper's success rate, response times, and data quality over time.

Conclusion

Building a Hacker News trend tracker is an excellent way to stay ahead of the tech curve. Start with the API, add keyword extraction, and let it run daily. After a few weeks, you'll have a powerful dataset showing exactly where the tech industry is heading.

DEV Community