Hacker News is one of the best sources for tracking emerging tech trends. In this guide, we'll build a Python scraper that monitors HN stories and identifies trending topics over time.
Why Scrape Hacker News?
HN's official API is great, but it doesn't give you trend analysis out of the box. By scraping and aggregating data, you can:
- Track which technologies gain momentum week over week
- Identify emerging frameworks before they go mainstream
- Monitor sentiment around specific tools or companies
Setting Up the Scraper
First, install the dependencies:
pip install requests beautifulsoup4 pandas
Fetching Top Stories via API
HN provides a public API. Let's start there:
import requests
import time
from collections import Counter
from datetime import datetime
def get_top_stories(limit=100):
url = "https://hacker-news.firebaseio.com/v0/topstories.json"
story_ids = requests.get(url).json()[:limit]
stories = []
for sid in story_ids:
item = requests.get(
f"https://hacker-news.firebaseio.com/v0/item/{sid}.json"
).json()
if item and item.get("title"):
stories.append({
"title": item["title"],
"score": item.get("score", 0),
"url": item.get("url", ""),
"time": datetime.fromtimestamp(item["time"]),
"comments": item.get("descendants", 0)
})
time.sleep(0.1)
return stories
Extracting Keywords for Trend Analysis
import re
from collections import Counter
STOP_WORDS = {"the", "a", "is", "in", "to", "and", "of", "for", "on", "with", "how", "why"}
def extract_trends(stories, top_n=20):
words = []
for story in stories:
tokens = re.findall(r'[A-Za-z]+', story["title"].lower())
words.extend([w for w in tokens if w not in STOP_WORDS and len(w) > 2])
return Counter(words).most_common(top_n)
stories = get_top_stories(200)
trends = extract_trends(stories)
for word, count in trends:
print(f"{word}: {count} mentions")
Building a Daily Tracker
import json, os
TRACKER_FILE = "hn_trends.json"
def update_tracker(trends):
history = {}
if os.path.exists(TRACKER_FILE):
with open(TRACKER_FILE) as f:
history = json.load(f)
today = datetime.now().strftime("%Y-%m-%d")
history[today] = dict(trends)
with open(TRACKER_FILE, "w") as f:
json.dump(history, f, indent=2)
return history
def detect_rising_trends(history, days=7):
dates = sorted(history.keys())[-days:]
if len(dates) < 2:
return []
recent = history[dates[-1]]
older = history[dates[0]]
rising = []
for word, count in recent.items():
old_count = older.get(word, 0)
if count > old_count * 1.5 and count >= 3:
rising.append((word, old_count, count))
return sorted(rising, key=lambda x: x[2] - x[1], reverse=True)
Scaling with Proxy Services
When you run this tracker continuously, you'll want to rotate IPs to avoid rate limits. ScraperAPI handles proxy rotation automatically:
def fetch_with_proxy(url):
proxy_url = f"http://api.scraperapi.com?api_key=YOUR_KEY&url={url}"
return requests.get(proxy_url).json()
For high-volume scraping, ThorData provides residential proxies that are perfect for sustained monitoring.
Storing and Visualizing Results
import pandas as pd
def trends_to_dataframe(history):
records = []
for date, keywords in history.items():
for word, count in keywords.items():
records.append({"date": date, "keyword": word, "count": count})
return pd.DataFrame(records)
df = trends_to_dataframe(history)
pivot = df.pivot_table(index="date", columns="keyword", values="count", fill_value=0)
print(pivot.tail(7))
Monitoring Your Scraper
Use ScrapeOps to monitor your scraper's success rate, response times, and data quality over time.
Conclusion
Building a Hacker News trend tracker is an excellent way to stay ahead of the tech curve. Start with the API, add keyword extraction, and let it run daily. After a few weeks, you'll have a powerful dataset showing exactly where the tech industry is heading.
Top comments (0)