Hacker News (HN) is the tech industry's watercooler — where startup founders, developers, and VCs share and discuss the latest in technology. Extracting HN data is valuable for trend analysis, competitive intelligence, and understanding what the developer community cares about.
In this guide, I'll compare using the official HN API versus web scraping, with practical Python examples for both.
Why Extract Hacker News Data?
- Tech trend analysis: Spot emerging technologies and frameworks
- Startup intelligence: Track which startups get attention
- Content strategy: Find topics that resonate with developers
- Hiring signals: Monitor "Who is Hiring?" threads
- Sentiment tracking: Gauge developer opinion on tools and platforms
Method 1: The Official Hacker News API
HN provides a free Firebase-based API — no authentication required:
import requests
import time
from concurrent.futures import ThreadPoolExecutor
HN_API = "https://hacker-news.firebaseio.com/v0"
def get_top_stories(limit=30):
"""Fetch top stories from HN API."""
response = requests.get(f"{HN_API}/topstories.json")
story_ids = response.json()[:limit]
stories = []
for story_id in story_ids:
item = requests.get(f"{HN_API}/item/{story_id}.json").json()
if item:
stories.append({
"id": item.get("id"),
"title": item.get("title", ""),
"url": item.get("url", ""),
"score": item.get("score", 0),
"by": item.get("by", ""),
"time": item.get("time", 0),
"descendants": item.get("descendants", 0),
"hn_url": f"https://news.ycombinator.com/item?id={item['id']}",
})
return stories
def get_item_with_comments(item_id, max_depth=3):
"""Fetch an item and its comment tree."""
item = requests.get(f"{HN_API}/item/{item_id}.json").json()
if not item:
return None
result = {
"id": item["id"],
"title": item.get("title", ""),
"text": item.get("text", ""),
"by": item.get("by", ""),
"score": item.get("score", 0),
"comments": [],
}
if max_depth > 0 and "kids" in item:
for kid_id in item["kids"][:20]: # Limit comments
comment = get_item_with_comments(kid_id, max_depth - 1)
if comment:
result["comments"].append(comment)
time.sleep(0.1)
return result
Faster API Fetching with Threading
The HN API requires one request per item, which is slow. Speed it up with threading:
def fetch_stories_parallel(story_ids, max_workers=10):
"""Fetch multiple stories in parallel."""
def fetch_one(story_id):
try:
resp = requests.get(f"{HN_API}/item/{story_id}.json", timeout=10)
return resp.json()
except Exception:
return None
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(fetch_one, story_ids))
return [r for r in results if r is not None]
Method 2: Web Scraping
For data not available through the API (like the rendered front page with rankings):
from bs4 import BeautifulSoup
def scrape_front_page():
"""Scrape the HN front page with rank positions."""
response = requests.get("https://news.ycombinator.com/")
soup = BeautifulSoup(response.text, "html.parser")
stories = []
rows = soup.select("tr.athing")
for row in rows:
rank_el = row.select_one(".rank")
title_el = row.select_one(".titleline > a")
site_el = row.select_one(".sitestr")
# The subtext row follows each story row
subtext = row.find_next_sibling("tr")
score_el = subtext.select_one(".score") if subtext else None
author_el = subtext.select_one(".hnuser") if subtext else None
comments_el = subtext.select("a")[-1] if subtext else None
stories.append({
"rank": int(rank_el.text.strip(".")) if rank_el else 0,
"title": title_el.text if title_el else "",
"url": title_el["href"] if title_el else "",
"site": site_el.text if site_el else "",
"score": int(score_el.text.split()[0]) if score_el else 0,
"author": author_el.text if author_el else "",
"comments": comments_el.text if comments_el else "",
"hn_id": row.get("id", ""),
})
return stories
API vs Web Scraping: When to Use Each
| Feature | API | Web Scraping |
|---|---|---|
| Rate limits | ~30 req/s | Needs proxies at scale |
| Real-time data | Yes | Yes |
| Historical data | Item by ID | Not available |
| Rankings/positions | No | Yes |
| Auth required | No | No |
| Comment trees | Yes (slow) | Complex to parse |
Monitoring Trends Over Time
import json
from datetime import datetime
def track_daily_trends(output_file="hn_trends.jsonl"):
"""Capture a daily snapshot of HN top stories."""
stories = get_top_stories(limit=30)
snapshot = {
"timestamp": datetime.utcnow().isoformat(),
"stories": stories,
}
with open(output_file, "a") as f:
f.write(json.dumps(snapshot) + "\n")
print(f"Saved {len(stories)} stories at {snapshot['timestamp']}")
return stories
Production Solutions
For production HN data extraction, there are pre-built scrapers on Apify:
- Hacker News Scraper — Full scraping with comments, user profiles, and search
- HN Top Stories — Lightweight tracker for monitoring the front page
For scaling your own HN scraper, ScraperAPI handles proxy rotation to avoid rate limiting when making thousands of API calls.
Analyzing "Who is Hiring?" Threads
import re
def parse_hiring_thread(item_id):
"""Extract job listings from monthly Who is Hiring threads."""
thread = get_item_with_comments(item_id, max_depth=1)
jobs = []
for comment in thread.get("comments", []):
text = comment.get("text", "")
if not text:
continue
# Extract company name (usually first line)
first_line = text.split("<p>")[0] if "<p>" in text else text[:100]
# Look for common patterns
remote = bool(re.search(r"\bremote\b", text, re.I))
location = re.search(r"\|\s*([^|]+?)\s*\|", first_line)
jobs.append({
"company": first_line[:80],
"remote": remote,
"location": location.group(1) if location else "",
"text": text[:500],
"by": comment.get("by", ""),
})
return jobs
Best Practices
- Prefer the API for structured data — it's official and reliable
- Use web scraping only for ranking data or rendered content
- Thread your API calls — HN requires one request per item
- Cache aggressively — stories don't change much after the first hour
- Use ScraperAPI if you're making thousands of requests
Conclusion
Hacker News offers both a clean API and scrapable HTML, making it one of the easiest tech community data sources to work with. Whether you use the API, web scraping, or managed solutions like the HN Scraper on Apify, the data is incredibly valuable for tech trend analysis.
Happy hacking!
Top comments (0)