DEV Community

agenthustler
agenthustler

Posted on

Hacker News Top Stories API vs Scraping: Complete Guide

Hacker News Top Stories API vs Scraping: Complete Guide

Hacker News is one of the most valuable data sources for tech trends, startup analysis, and developer sentiment. But should you use the official API, the Algolia search API, or scrape the site directly?

This guide covers all three approaches with working Python code.

The Official HN API

Hacker News provides a simple Firebase-based API at https://hacker-news.firebaseio.com/v0/. It is free, requires no authentication, and has no strict rate limits.

Fetching Top Stories

import requests
import time

def get_top_stories(limit=30):
    """Fetch top stories from HN official API."""
    top_ids = requests.get(
        "https://hacker-news.firebaseio.com/v0/topstories.json"
    ).json()[:limit]

    stories = []
    for story_id in top_ids:
        item = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
        ).json()
        stories.append({
            "title": item.get("title"),
            "url": item.get("url"),
            "score": item.get("score"),
            "by": item.get("by"),
            "time": item.get("time"),
            "descendants": item.get("descendants", 0)
        })
        time.sleep(0.1)  # Be respectful
    return stories

for story in get_top_stories(10):
    print(f"{story["score"]} pts | {story["title"]}")
Enter fullscreen mode Exit fullscreen mode

API Endpoints Summary

Endpoint Returns
/v0/topstories.json Top 500 story IDs
/v0/newstories.json Newest 500 story IDs
/v0/beststories.json Best 500 story IDs
/v0/item/{id}.json Single item details
/v0/user/{id}.json User profile

The main limitation: you must fetch each item individually. Getting 500 stories means 501 API calls.

The Algolia Search API

For searching and filtering, the Algolia-powered HN Search API at https://hn.algolia.com/api/v1/ is far more powerful.

import requests
from datetime import datetime, timedelta

def search_hn(query, tags="story", time_range_hours=24):
    """Search HN stories via Algolia API."""
    timestamp = int((datetime.now() - timedelta(hours=time_range_hours)).timestamp())

    resp = requests.get("https://hn.algolia.com/api/v1/search", params={
        "query": query,
        "tags": tags,
        "numericFilters": f"created_at_i>{timestamp}",
        "hitsPerPage": 20
    }).json()

    for hit in resp["hits"]:
        print(f"{hit["points"]} pts | {hit["title"]}")
        print(f"  Comments: {hit["num_comments"]} | {hit["url"]}")
    return resp["hits"]

# Find AI stories from last 24 hours
search_hn("artificial intelligence")

# Find Python stories from last week
search_hn("python", time_range_hours=168)
Enter fullscreen mode Exit fullscreen mode

The Algolia API supports full-text search, date filtering, sorting by points or date, and pagination — things the official API cannot do.

When to Scrape Instead

Sometimes the APIs are not enough. You might need:

  • Comment threads with full nesting (the API returns flat IDs)
  • Real-time monitoring of front page rankings over time
  • Historical data beyond what Algolia indexes
  • Structured bulk exports without making thousands of individual API calls

For these cases, scraping with proper proxy rotation through ScraperAPI ensures reliable access without getting blocked.

import requests
from bs4 import BeautifulSoup

SCRAPERAPI_KEY = "your_key_here"

def scrape_hn_front_page():
    """Scrape HN front page via ScraperAPI proxy."""
    url = f"http://api.scraperapi.com?api_key={SCRAPERAPI_KEY}&url=https://news.ycombinator.com"
    soup = BeautifulSoup(requests.get(url).text, "html.parser")

    stories = []
    for row in soup.select(".athing"):
        title_el = row.select_one(".titleline > a")
        subtext = row.find_next_sibling("tr").select_one(".subtext")
        score_el = subtext.select_one(".score") if subtext else None

        stories.append({
            "title": title_el.text if title_el else "",
            "url": title_el["href"] if title_el else "",
            "score": int(score_el.text.split()[0]) if score_el else 0
        })
    return stories
Enter fullscreen mode Exit fullscreen mode

The Easy Route: Pre-Built Actors

If you need production-grade HN data collection without building scrapers from scratch, HN Top Stories Actor on Apify handles pagination, rate limiting, and data export automatically. It outputs structured JSON ready for analysis.

Choosing the Right Approach

Need Best Approach
Top/new/best stories Official API
Search with filters Algolia API
Comment thread analysis Scraping
Historical tracking Scraping + database
Bulk data export Pre-built actor
Real-time monitoring Scraping with proxies

Combining Approaches

The most robust HN data pipeline combines all three:

def comprehensive_hn_pipeline(topic):
    # 1. Search for relevant stories via Algolia
    stories = search_hn(topic, time_range_hours=72)

    # 2. Enrich with official API data
    for story in stories:
        detail = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{story["objectID"]}.json"
        ).json()
        story["kids"] = detail.get("kids", [])
        story["type"] = detail.get("type")

    # 3. Scrape comment threads for sentiment
    # Use ScraperAPI for reliable access
    for story in stories[:5]:  # Top 5 only
        comments = scrape_comments(story["objectID"])
        story["comment_text"] = comments

    return stories
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Start with the APIs — they are free and reliable for most use cases
  2. Use Algolia for search — the official API has no search capability
  3. Scrape when APIs fall short — comment analysis, historical tracking, bulk exports
  4. Use proxy rotation via ScraperAPI when scraping to avoid IP blocks
  5. Consider pre-built tools like the HN Top Stories Actor for production pipelines

Hacker News data is incredibly valuable for market research, trend analysis, and understanding developer sentiment. The right extraction approach depends on your specific needs — but now you have working code for all three methods.


Happy scraping! Follow me for more web scraping tutorials and data extraction guides.

Top comments (0)