Why I Stopped Using the Hacker News API Directly (and What I Use Instead)

#python #webdev #tutorial #data

I've been pulling data from Hacker News for over a year. I started where everyone starts: the official Firebase API.

And for about two weeks, it was fine. Then reality set in.

The problem with the HN API

The Hacker News API is technically correct. It returns items by ID. It returns top story IDs. It does what it says.

But if you want to do anything practical — like get the top 50 stories with their comment counts, scores, and metadata in a single call — you're looking at 51 HTTP requests minimum. One for the top stories list, then one per story.

Here's what that looks like in Python:

import requests

def get_top_stories(n=50):
    top_ids = requests.get(
        'https://hacker-news.firebaseio.com/v0/topstories.json'
    ).json()[:n]

    stories = []
    for story_id in top_ids:
        item = requests.get(
            f'https://hacker-news.firebaseio.com/v0/item/{story_id}.json'
        ).json()
        stories.append(item)

    return stories

This works. It's also painfully slow. On a good day, you're waiting 8-12 seconds for 50 stories. On a bad day with rate limiting, much longer.

You can parallelize with asyncio and aiohttp, but then you're managing connection pools, handling rate limits, retrying failed requests, and parsing inconsistent response shapes. The HN API returns different fields depending on item type (story, comment, poll, job). There's no schema. No pagination for comments. No filtering.

I wrote that boilerplate three times before deciding there had to be a better way.

What I actually use now

I switched to using a pre-built scraper that handles all the ugly parts: HN Top Stories on Apify.

The difference is night and day. Instead of managing 50+ HTTP calls, I get structured JSON back with one API call:

from apify_client import ApifyClient

client = ApifyClient("your-api-token")

run = client.actor("cryptosignals/hn-top-stories").call(
    run_input={
        "maxItems": 100,
        "minScore": 10,
        "includeComments": True
    }
)

items = list(client.dataset(run["defaultDatasetId"]).iterate_items())

That gives me up to 500 stories, pre-filtered by score, with full comment trees already extracted. The response is consistent — every item has the same shape.

Why this matters for data projects

If you're building a trend detector, a sentiment analyzer, or even a simple dashboard, the data collection step shouldn't be the hard part. But with the raw HN API, it is.

Here's what I was spending time on before switching:

Rate limit handling: HN doesn't publish rate limits, so you discover them empirically (and differently each time)
Comment tree traversal: Comments are stored as nested IDs. To get a full thread, you need recursive fetching. For a post with 300 comments, that's 300+ additional API calls
Data normalization: The API returns null for deleted items, different fields for different item types, and timestamps in Unix epoch
Caching and deduplication: If you're polling every hour, you need to diff against previous results

All of that is now someone else's problem. The Apify actor handles it, and I get clean data out.

The code I actually ship now

My current HN monitoring script is 40 lines instead of 200:

from apify_client import ApifyClient
import json

client = ApifyClient("your-api-token")

def get_trending(min_score=20, keywords=None):
    run_input = {"maxItems": 200, "minScore": min_score}
    if keywords:
        run_input["keyword"] = keywords[0]

    run = client.actor("cryptosignals/hn-top-stories").call(
        run_input=run_input
    )

    items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    return sorted(items, key=lambda x: x.get("score", 0), reverse=True)

# Get AI-related stories scoring above 50
trending = get_trending(min_score=50, keywords=["AI"])
for story in trending[:10]:
    print(f"{story['score']} | {story['title']}")

No connection pool management. No retry logic. No response parsing. Just data.

When to still use the raw API

The direct API still makes sense for:

Fetching a single item by ID (it's instant)
Real-time streaming via the /v0/updates endpoint
Building something where you need sub-second freshness

For everything else — batch collection, filtering, comment extraction, historical data — I'd reach for a dedicated tool. The HN API is a data source, not a data pipeline. Treating it like a pipeline is where most projects get stuck.

I've been using the HN Top Stories actor for my own projects. If you're doing anything with HN data at scale, it saves a surprising amount of plumbing code.