Why Scrape Hacker News?
Hacker News gets 10M+ visits/month from developers, founders, and investors. Whether you're tracking trends, monitoring mentions of your product, or building a dataset for analysis — HN data is gold.
The good news: HN has an official search API powered by Algolia. Most devs don't know about it. Let me show you how to use it properly.
The Algolia Search API
Base URL: https://hn.algolia.com/api/v1/
Fetching Stories by Type
import requests
# Search stories
resp = requests.get("https://hn.algolia.com/api/v1/search", params={
"query": "LLM",
"tags": "story",
"hitsPerPage": 20
})
stories = resp.json()["hits"]
for s in stories:
print(f"{s[\"points\"]}pts - {s[\"title\"]} ({s[\"url\"]})")
The tags parameter controls what you get:
| Tag | What it returns |
|---|---|
story |
All stories |
ask_hn |
Ask HN posts |
show_hn |
Show HN posts |
comment |
Comments only |
front_page |
Currently on front page |
Date Filtering with numeric_filters
This is the killer feature most people miss. You can filter by Unix timestamp:
import time
# Stories from the last 24 hours
yesterday = int(time.time()) - 86400
resp = requests.get("https://hn.algolia.com/api/v1/search_by_date", params={
"tags": "story",
"numericFilters": f"created_at_i>{yesterday}",
"hitsPerPage": 50
})
You can combine filters: created_at_i>X,created_at_i<Y,points>100 gives you highly-voted stories in a date range.
Sorting: Relevance vs Date
Two endpoints handle this:
-
/search— sorted by relevance (default) -
/search_by_date— sorted by date (newest first)
For monitoring use cases (tracking mentions, watching trends), search_by_date is what you want.
Fetching Comment Trees
Each story has a nested comment tree. You can fetch it by item ID:
# Get full item with comments
item = requests.get(
f"https://hn.algolia.com/api/v1/items/{story_id}"
).json()
def walk_comments(children, depth=0):
for c in children:
print(" " * depth + c.get("text", "")[:80])
walk_comments(c.get("children", []), depth + 1)
walk_comments(item.get("children", []))
User Profiles
user = requests.get(
"https://hn.algolia.com/api/v1/users/pg"
).json()
print(f"Karma: {user[\"karma\"]}, Account created: {user[\"created_at\"]}")
Domain Extraction
Want to find all HN submissions from a specific domain? The API supports this natively:
/search?query=&tags=story&restrictSearchableAttributes=url&query=techcrunch.com
This is incredibly useful for competitive analysis — see how your competitors' content performs on HN.
Rate Limits and Gotchas
The Algolia API is generous but has limits:
- 10,000 requests/hour (IP-based)
- Max 1,000 results per query (pagination tops out at page 50 × 20 hits)
- No bulk export endpoint — you need to paginate through date windows
For anything beyond light usage, you'll hit pagination limits fast. If you need full historical data or continuous monitoring, that's where purpose-built tools come in.
Scaling Up: Automated HN Scraping
For production use cases — daily trend reports, mention monitoring, or building datasets — I built an Apify actor for Hacker News scraping that handles:
- All story types (top, new, best, Ask HN, Show HN)
- Date range filtering with automatic pagination
- sortBy parameter (relevance, date, points)
- Domain extraction from submitted URLs
- Full comment trees with nested replies
- Built-in proxy rotation and retry logic
It runs on Apify's infrastructure so you don't need to manage rate limits or pagination yourself.
Quick Reference
| Use Case | Endpoint | Key Params |
|---|---|---|
| Search stories | /search |
query, tags=story
|
| Recent stories | /search_by_date |
tags=story, numericFilters
|
| Front page | /search |
tags=front_page |
| Comments on story | /items/{id} |
— |
| User profile | /users/{username} |
— |
| Domain filter | /search |
restrictSearchableAttributes=url |
HN's API is one of the best-kept secrets in the scraping world. For quick scripts, it's all you need. For production pipelines, pair it with proper infrastructure and you're set.
Top comments (0)