Hacker News Top Stories API vs Scraping: Complete Guide
Hacker News is one of the most valuable data sources for tech trends, startup analysis, and developer sentiment. But should you use the official API, the Algolia search API, or scrape the site directly?
This guide covers all three approaches with working Python code.
The Official HN API
Hacker News provides a simple Firebase-based API at https://hacker-news.firebaseio.com/v0/. It is free, requires no authentication, and has no strict rate limits.
Fetching Top Stories
import requests
import time
def get_top_stories(limit=30):
"""Fetch top stories from HN official API."""
top_ids = requests.get(
"https://hacker-news.firebaseio.com/v0/topstories.json"
).json()[:limit]
stories = []
for story_id in top_ids:
item = requests.get(
f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
).json()
stories.append({
"title": item.get("title"),
"url": item.get("url"),
"score": item.get("score"),
"by": item.get("by"),
"time": item.get("time"),
"descendants": item.get("descendants", 0)
})
time.sleep(0.1) # Be respectful
return stories
for story in get_top_stories(10):
print(f"{story["score"]} pts | {story["title"]}")
API Endpoints Summary
| Endpoint | Returns |
|---|---|
/v0/topstories.json |
Top 500 story IDs |
/v0/newstories.json |
Newest 500 story IDs |
/v0/beststories.json |
Best 500 story IDs |
/v0/item/{id}.json |
Single item details |
/v0/user/{id}.json |
User profile |
The main limitation: you must fetch each item individually. Getting 500 stories means 501 API calls.
The Algolia Search API
For searching and filtering, the Algolia-powered HN Search API at https://hn.algolia.com/api/v1/ is far more powerful.
import requests
from datetime import datetime, timedelta
def search_hn(query, tags="story", time_range_hours=24):
"""Search HN stories via Algolia API."""
timestamp = int((datetime.now() - timedelta(hours=time_range_hours)).timestamp())
resp = requests.get("https://hn.algolia.com/api/v1/search", params={
"query": query,
"tags": tags,
"numericFilters": f"created_at_i>{timestamp}",
"hitsPerPage": 20
}).json()
for hit in resp["hits"]:
print(f"{hit["points"]} pts | {hit["title"]}")
print(f" Comments: {hit["num_comments"]} | {hit["url"]}")
return resp["hits"]
# Find AI stories from last 24 hours
search_hn("artificial intelligence")
# Find Python stories from last week
search_hn("python", time_range_hours=168)
The Algolia API supports full-text search, date filtering, sorting by points or date, and pagination — things the official API cannot do.
When to Scrape Instead
Sometimes the APIs are not enough. You might need:
- Comment threads with full nesting (the API returns flat IDs)
- Real-time monitoring of front page rankings over time
- Historical data beyond what Algolia indexes
- Structured bulk exports without making thousands of individual API calls
For these cases, scraping with proper proxy rotation through ScraperAPI ensures reliable access without getting blocked.
import requests
from bs4 import BeautifulSoup
SCRAPERAPI_KEY = "your_key_here"
def scrape_hn_front_page():
"""Scrape HN front page via ScraperAPI proxy."""
url = f"http://api.scraperapi.com?api_key={SCRAPERAPI_KEY}&url=https://news.ycombinator.com"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
stories = []
for row in soup.select(".athing"):
title_el = row.select_one(".titleline > a")
subtext = row.find_next_sibling("tr").select_one(".subtext")
score_el = subtext.select_one(".score") if subtext else None
stories.append({
"title": title_el.text if title_el else "",
"url": title_el["href"] if title_el else "",
"score": int(score_el.text.split()[0]) if score_el else 0
})
return stories
The Easy Route: Pre-Built Actors
If you need production-grade HN data collection without building scrapers from scratch, HN Top Stories Actor on Apify handles pagination, rate limiting, and data export automatically. It outputs structured JSON ready for analysis.
Choosing the Right Approach
| Need | Best Approach |
|---|---|
| Top/new/best stories | Official API |
| Search with filters | Algolia API |
| Comment thread analysis | Scraping |
| Historical tracking | Scraping + database |
| Bulk data export | Pre-built actor |
| Real-time monitoring | Scraping with proxies |
Combining Approaches
The most robust HN data pipeline combines all three:
def comprehensive_hn_pipeline(topic):
# 1. Search for relevant stories via Algolia
stories = search_hn(topic, time_range_hours=72)
# 2. Enrich with official API data
for story in stories:
detail = requests.get(
f"https://hacker-news.firebaseio.com/v0/item/{story["objectID"]}.json"
).json()
story["kids"] = detail.get("kids", [])
story["type"] = detail.get("type")
# 3. Scrape comment threads for sentiment
# Use ScraperAPI for reliable access
for story in stories[:5]: # Top 5 only
comments = scrape_comments(story["objectID"])
story["comment_text"] = comments
return stories
Key Takeaways
- Start with the APIs — they are free and reliable for most use cases
- Use Algolia for search — the official API has no search capability
- Scrape when APIs fall short — comment analysis, historical tracking, bulk exports
- Use proxy rotation via ScraperAPI when scraping to avoid IP blocks
- Consider pre-built tools like the HN Top Stories Actor for production pipelines
Hacker News data is incredibly valuable for market research, trend analysis, and understanding developer sentiment. The right extraction approach depends on your specific needs — but now you have working code for all three methods.
Happy scraping! Follow me for more web scraping tutorials and data extraction guides.
Top comments (0)