How I Built a Hacker News Trend Detector Using Only Public Data

#javascript #python #data #webdev

Every day, thousands of stories compete for the Hacker News front page. I wanted to detect trending topics before they blow up — using nothing but public APIs and a bit of Python.

Here's how I built a simple HN trend detector, and what I learned about the data along the way.

The HN Firebase API

Hacker News runs on a public Firebase API. No auth needed. The two endpoints that matter:

import requests

# Get current top 500 story IDs
top_ids = requests.get("https://hacker-news.firebaseio.com/v0/topstories.json").json()

# Get details for a single story
story = requests.get(f"https://hacker-news.firebaseio.com/v0/item/{top_ids[0]}.json").json()
print(story["title"], story["score"], story["descendants"])  # descendants = comment count

This gives you title, score, author, timestamp, comment count, and URL for every item. Simple, but powerful.

Detecting Velocity, Not Just Score

A story with 300 points after 12 hours is stale. A story with 80 points after 40 minutes is exploding. The key metric is points per hour:

import time

def velocity(story):
    age_hours = (time.time() - story["time"]) / 3600
    if age_hours < 0.1:
        return 0  # too new to judge
    return story["score"] / age_hours

# Fetch top 30 stories and rank by velocity
stories = []
for sid in top_ids[:30]:
    s = requests.get(f"https://hacker-news.firebaseio.com/v0/item/{sid}.json").json()
    s["velocity"] = velocity(s)
    stories.append(s)

stories.sort(key=lambda s: s["velocity"], reverse=True)

for s in stories[:10]:
    print(f"{s[\"velocity\"]:.1f} pts/hr | {s[\"score\"]} pts | {s[\"title\"]}")

Sample output:

142.3 pts/hr | 87 pts | Show HN: I made a tool that...
 98.1 pts/hr | 203 pts | The hidden cost of...
 67.4 pts/hr | 312 pts | Why we switched from...

Stories with high velocity in their first 1-2 hours almost always make it to #1.

Adding Comment Sentiment

HN comments are gold. A story with 200 points but hostile comments won't last. I added a simple ratio check:

def engagement_ratio(story):
    comments = story.get("descendants", 0)
    if story["score"] == 0:
        return 0
    return comments / story["score"]

Ratios above 1.5 usually mean controversy. Below 0.3 means people upvote but don't discuss — often link-heavy or self-explanatory content. The sweet spot (0.5-1.2) indicates genuine interest.

Tracking Topics Over Time

To spot trends across days, I store snapshots in SQLite:

import sqlite3
from datetime import datetime

db = sqlite3.connect("hn_trends.db")
db.execute("""CREATE TABLE IF NOT EXISTS snapshots (
    id INTEGER, title TEXT, score INTEGER, 
    comments INTEGER, velocity REAL,
    captured_at TEXT
)""")

for s in stories:
    db.execute(
        "INSERT INTO snapshots VALUES (?,?,?,?,?,?)",
        (s["id"], s["title"], s["score"],
         s.get("descendants", 0), s["velocity"],
         datetime.utcnow().isoformat())
    )
db.commit()

Run this every 30 minutes via cron, and after a week you can query for patterns:

-- Topics that appeared on front page 3+ times this week
SELECT title, COUNT(*) as appearances, MAX(score) as peak_score
FROM snapshots
WHERE captured_at > datetime("now", "-7 days")
GROUP BY id
HAVING appearances >= 3
ORDER BY peak_score DESC;

Scaling Up: When the Firebase API Isn't Enough

The Firebase API is great for real-time data, but it has limits:

No bulk export (you fetch one item at a time)
No historical data beyond current top/new/best lists
Comment trees require recursive fetching (slow for 500+ comment threads)

If you need structured historical data or full comment threads at scale, Apify has several HN scrapers that handle the heavy lifting. I've been using HN Top Stories Scraper which pulls structured data including full comment threads — useful when you want to analyze discussion patterns without writing your own recursive crawler.

The Full Pipeline

My production setup:

Cron job every 30 min → fetches top 100 via Firebase API
Velocity calculator flags stories above 50 pts/hr
SQLite storage for historical analysis
Weekly digest email with recurring topics and velocity outliers

Total code: ~120 lines of Python. No ML, no fancy NLP. Just velocity math and some SQL.

What I Found

After running this for a few weeks:

AI/LLM stories consistently hit the highest velocities (80+ pts/hr)
Show HN posts have the best engagement ratios
Stories posted between 9-11am ET get 2x the velocity of evening posts
The comment-to-score ratio reliably predicts whether a story stays on the front page

The full code is straightforward enough to run on any $5 VPS. If you're interested in HN data analysis, start with the Firebase API — it's surprisingly capable for a free, unauthenticated endpoint.

What patterns have you noticed on HN? Drop a comment if you've built something similar.