How to build a free HN data pipeline in 30 minutes

#python #datascience #tutorial #webdev

Hacker News is one of the richest sources of signal in tech. New frameworks, hiring waves, shifting sentiment — it all shows up on HN before it hits mainstream. But scraping HN yourself is tedious and fragile.

In this tutorial, I'll walk you through building a lightweight data pipeline that pulls structured HN data on a schedule, stores it locally, and runs basic trend detection — all for free.

The Data Source

We'll use the HN Top Stories actor on Apify, which returns clean JSON for top, new, best, ask, show, and job stories. It handles pagination, rate limits, and retries so you don't have to.

Apify's free tier gives you enough compute to run this daily without paying a cent.

Step 1: Fetch HN Data

Install the Apify client:

pip install apify-client

Then pull the latest top stories:

from apify_client import ApifyClient
from datetime import datetime

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={"category": "topstories", "maxItems": 100}
)

items = list(
    client.dataset(run["defaultDatasetId"]).iterate_items()
)
print(f"Fetched {len(items)} stories")

Each item gives you the title, URL, score, author, comment count, and timestamp — everything you need for analysis.

Step 2: Store Results in SQLite

Save each run to a local database so you can track changes over time:

import sqlite3

db = sqlite3.connect("hn_pipeline.db")
db.execute(
    "CREATE TABLE IF NOT EXISTS stories ("
    "id INTEGER, title TEXT, url TEXT, score INTEGER, "
    "comments INTEGER, author TEXT, fetched_at TEXT, "
    "PRIMARY KEY (id, fetched_at))"
)

fetched_at = datetime.utcnow().isoformat()
for item in items:
    db.execute(
        "INSERT OR IGNORE INTO stories VALUES (?,?,?,?,?,?,?)",
        (item["id"], item["title"], item.get("url", ""),
         item["score"], item.get("descendants", 0),
         item["by"], fetched_at)
    )
db.commit()

Step 3: Detect Trends

With a few days of data, you can spot rising topics:

from collections import Counter

rows = db.execute(
    "SELECT title FROM stories "
    "WHERE fetched_at > datetime('now', '-7 days')"
).fetchall()

words = []
for (title,) in rows:
    words.extend(
        w.lower() for w in title.split() if len(w) > 3
    )

trends = Counter(words).most_common(20)
for word, count in trends:
    print(f"{word:20s} {count}")

Run this daily and diff against the previous week to catch emerging topics early — useful for content planning, market research, or just staying ahead of the curve.

Step 4: Automate It

Add a cron job to run on schedule. The HN Top Stories actor also supports scheduled runs natively on Apify, so you can set it to run every 6 hours and have fresh data waiting.

A minimal cron entry:

0 */6 * * * cd ~/hn-pipeline && python3 fetch.py && python3 trends.py

Real Use Case: Job Monitoring

One practical application is monitoring HN "Who is Hiring" threads. Set the actor category to jobstories, then filter for keywords matching your stack:

keywords = ["python", "fastapi", "remote", "senior"]
matches = [
    s for s in items
    if any(k in s["title"].lower() for k in keywords)
]

Pipe matches into a Slack webhook or email digest and you have a free, targeted job alert system.

Wrapping Up

The full pipeline is under 50 lines of Python, runs on free-tier infrastructure, and gives you structured access to one of the best signal sources in tech. The Apify HN actor handles the scraping; you just handle the analysis.

Grab the code, set up a schedule, and start mining HN data today.