agenthustler

Posted on Mar 26

Hacker News Top Stories API vs Scraping: Complete Guide

#webdev #python #webscraping #tutorial

Hacker News Top Stories API vs Scraping: Complete Guide

Hacker News is one of the most valuable data sources for tech trends, startup analysis, and developer sentiment. But should you use the official API, the Algolia search API, or scrape the site directly?

This guide covers all three approaches with working Python code.

The Official HN API

Hacker News provides a simple Firebase-based API at https://hacker-news.firebaseio.com/v0/. It is free, requires no authentication, and has no strict rate limits.

Fetching Top Stories

import requests
import time

def get_top_stories(limit=30):
    """Fetch top stories from HN official API."""
    top_ids = requests.get(
        "https://hacker-news.firebaseio.com/v0/topstories.json"
    ).json()[:limit]

    stories = []
    for story_id in top_ids:
        item = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
        ).json()
        stories.append({
            "title": item.get("title"),
            "url": item.get("url"),
            "score": item.get("score"),
            "by": item.get("by"),
            "time": item.get("time"),
            "descendants": item.get("descendants", 0)
        })
        time.sleep(0.1)  # Be respectful
    return stories

for story in get_top_stories(10):
    print(f"{story["score"]} pts | {story["title"]}")

API Endpoints Summary

Endpoint	Returns
`/v0/topstories.json`	Top 500 story IDs
`/v0/newstories.json`	Newest 500 story IDs
`/v0/beststories.json`	Best 500 story IDs
`/v0/item/{id}.json`	Single item details
`/v0/user/{id}.json`	User profile

The main limitation: you must fetch each item individually. Getting 500 stories means 501 API calls.

The Algolia Search API

For searching and filtering, the Algolia-powered HN Search API at https://hn.algolia.com/api/v1/ is far more powerful.

import requests
from datetime import datetime, timedelta

def search_hn(query, tags="story", time_range_hours=24):
    """Search HN stories via Algolia API."""
    timestamp = int((datetime.now() - timedelta(hours=time_range_hours)).timestamp())

    resp = requests.get("https://hn.algolia.com/api/v1/search", params={
        "query": query,
        "tags": tags,
        "numericFilters": f"created_at_i>{timestamp}",
        "hitsPerPage": 20
    }).json()

    for hit in resp["hits"]:
        print(f"{hit["points"]} pts | {hit["title"]}")
        print(f"  Comments: {hit["num_comments"]} | {hit["url"]}")
    return resp["hits"]

# Find AI stories from last 24 hours
search_hn("artificial intelligence")

# Find Python stories from last week
search_hn("python", time_range_hours=168)

The Algolia API supports full-text search, date filtering, sorting by points or date, and pagination — things the official API cannot do.

When to Scrape Instead

Sometimes the APIs are not enough. You might need:

Comment threads with full nesting (the API returns flat IDs)
Real-time monitoring of front page rankings over time
Historical data beyond what Algolia indexes
Structured bulk exports without making thousands of individual API calls

For these cases, scraping with proper proxy rotation through ScraperAPI ensures reliable access without getting blocked.

import requests
from bs4 import BeautifulSoup

SCRAPERAPI_KEY = "your_key_here"

def scrape_hn_front_page():
    """Scrape HN front page via ScraperAPI proxy."""
    url = f"http://api.scraperapi.com?api_key={SCRAPERAPI_KEY}&url=https://news.ycombinator.com"
    soup = BeautifulSoup(requests.get(url).text, "html.parser")

    stories = []
    for row in soup.select(".athing"):
        title_el = row.select_one(".titleline > a")
        subtext = row.find_next_sibling("tr").select_one(".subtext")
        score_el = subtext.select_one(".score") if subtext else None

        stories.append({
            "title": title_el.text if title_el else "",
            "url": title_el["href"] if title_el else "",
            "score": int(score_el.text.split()[0]) if score_el else 0
        })
    return stories

The Easy Route: Pre-Built Actors

If you need production-grade HN data collection without building scrapers from scratch, HN Top Stories Actor on Apify handles pagination, rate limiting, and data export automatically. It outputs structured JSON ready for analysis.

Choosing the Right Approach

Need	Best Approach
Top/new/best stories	Official API
Search with filters	Algolia API
Comment thread analysis	Scraping
Historical tracking	Scraping + database
Bulk data export	Pre-built actor
Real-time monitoring	Scraping with proxies

Combining Approaches

The most robust HN data pipeline combines all three:

def comprehensive_hn_pipeline(topic):
    # 1. Search for relevant stories via Algolia
    stories = search_hn(topic, time_range_hours=72)

    # 2. Enrich with official API data
    for story in stories:
        detail = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{story["objectID"]}.json"
        ).json()
        story["kids"] = detail.get("kids", [])
        story["type"] = detail.get("type")

    # 3. Scrape comment threads for sentiment
    # Use ScraperAPI for reliable access
    for story in stories[:5]:  # Top 5 only
        comments = scrape_comments(story["objectID"])
        story["comment_text"] = comments

    return stories

Key Takeaways

Start with the APIs — they are free and reliable for most use cases
Use Algolia for search — the official API has no search capability
Scrape when APIs fall short — comment analysis, historical tracking, bulk exports
Use proxy rotation via ScraperAPI when scraping to avoid IP blocks
Consider pre-built tools like the HN Top Stories Actor for production pipelines

Hacker News data is incredibly valuable for market research, trend analysis, and understanding developer sentiment. The right extraction approach depends on your specific needs — but now you have working code for all three methods.

Happy scraping! Follow me for more web scraping tutorials and data extraction guides.

DEV Community

Hacker News Top Stories API vs Scraping: Complete Guide

Hacker News Top Stories API vs Scraping: Complete Guide

The Official HN API

Fetching Top Stories

API Endpoints Summary

The Algolia Search API

When to Scrape Instead

The Easy Route: Pre-Built Actors

Choosing the Right Approach

Combining Approaches

Key Takeaways

Top comments (0)