Hacker News is one of the highest signal-to-noise communities on the internet. Started by Y Combinator in 2007, it's the default gathering place for founders, engineers, investors, and researchers who shape the tech industry. When a new framework gains traction, a startup gets acquired, or a security vulnerability drops — the discussion happens on HN first, often days before mainstream coverage.
That makes HN data genuinely valuable. Recruiters mine "Who's Hiring" threads. VCs track what technical audiences are excited about. Researchers study developer sentiment. Content marketers identify trending topics. And founders validate ideas by watching what the community upvotes.
In this guide, I'll walk through every mode of the Hacker News Scraper on Apify — top stories, newest, Ask HN, Show HN, jobs, and full-text search — with Python code for each. Plus how to extract comment trees and user profiles at scale.
Why Hacker News Data Is Valuable
HN isn't just another tech forum. The community has specific properties that make its data uniquely useful:
Tech opinion leaders. HN's user base skews heavily toward senior engineers, CTOs, and founders. When a post about a new tool hits 300+ points, that's signal from people who actually build production systems — not casual upvotes.
Early adopter signal. Technologies that trend on HN often go mainstream 6-12 months later. Docker, Kubernetes, and GPT-3 all had their breakout HN moments before the broader developer community caught on.
Unfiltered sentiment. Unlike LinkedIn or Twitter where people perform for their network, HN comments tend to be technically honest. Negative feedback on a product launch here is more informative than five-star reviews elsewhere.
Structured data. Every story has a score, comment count, author, timestamp, and URL. Every comment has parent-child relationships. This makes HN data ideal for quantitative analysis without extensive preprocessing.
How the Hacker News API Works
HN exposes two public APIs — both free, both unauthenticated:
1. Firebase API (Official)
Base URL: https://hacker-news.firebaseio.com/v0/
This is the official API maintained by Y Combinator. Every item (stories, comments, jobs, polls) has an integer ID and lives at a predictable URL:
# Fetch a single item
curl "https://hacker-news.firebaseio.com/v0/item/1.json"
# Get current top story IDs
curl "https://hacker-news.firebaseio.com/v0/topstories.json"
The Firebase API returns arrays of IDs for story lists, then you fetch each item individually. For 500 stories, that's 500+ HTTP requests — which is why batch processing with concurrency matters.
2. Algolia API (Search)
Base URL: https://hn.algolia.com/api/v1/
Algolia indexes all HN content and provides full-text search with filtering, sorting, and pagination:
# Search for posts about "rust programming"
curl "https://hn.algolia.com/api/v1/search?query=rust+programming&tags=story&hitsPerPage=50"
This returns rich results with title, URL, points, comment count, author, and timestamps — all in one response.
The Scraper: 7 Modes Explained
The Hacker News Scraper wraps both APIs with production-grade logic: concurrent fetching, automatic pagination, comment tree extraction, and multiple export formats.
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
category |
string | "top" |
top, new, best, ask, show, jobs, or search
|
searchQuery |
string | "" |
Keyword query (required when category is search) |
maxItems |
integer | 100 |
Max stories to return (1-500) |
includeComments |
boolean | false |
Fetch full comment trees |
maxCommentsPerStory |
integer | 50 |
Max comments per story (1-500) |
scrapeType |
string | "stories" |
stories, users, or both
|
Mode: Top Stories
The default mode. Returns the current top stories ranked by HN's algorithm (a combination of points and time decay).
import requests
import time
APIFY_TOKEN = "YOUR_APIFY_API_TOKEN"
ACTOR_ID = "cryptosignals/hackernews-scraper"
# Start a run for top 50 stories
run_resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={"category": "top", "maxItems": 50},
)
run = run_resp.json()["data"]
run_id = run["id"]
dataset_id = run["defaultDatasetId"]
# Wait for completion
while True:
status = requests.get(
f"https://api.apify.com/v2/actor-runs/{run_id}",
headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
).json()["data"]["status"]
if status in ("SUCCEEDED", "FAILED", "ABORTED"):
break
time.sleep(3)
# Fetch results
stories = requests.get(
f"https://api.apify.com/v2/datasets/{dataset_id}/items",
headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
).json()
for s in stories[:5]:
print(f"[{s['score']} pts] {s['title']}")
print(f" By: {s['author']} | Comments: {s['commentCount']}")
print(f" URL: {s.get('url', s['hnUrl'])}")
print()
Output Schema
{
"id": 42345678,
"title": "Show HN: I built an AI-powered code reviewer",
"url": "https://example.com/project",
"text": null,
"author": "techfounder",
"score": 342,
"commentCount": 156,
"createdAt": "2026-03-15T10:30:00.000Z",
"storyType": "show",
"hnUrl": "https://news.ycombinator.com/item?id=42345678"
}
Mode: Newest Stories
Fetch the most recently submitted stories — useful for monitoring new submissions in real time.
run_resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={"category": "new", "maxItems": 100},
)
Use Case: Link Monitoring
Run this every hour to catch when your company, product, or competitor gets submitted to HN. Combined with a webhook, you can get Slack notifications within minutes of a submission.
Mode: Ask HN
"Ask HN" posts are questions from the community — they have no external URL, just a text body. These are goldmines for understanding developer pain points.
# Get the latest 30 Ask HN posts with comments
run_resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"category": "ask",
"maxItems": 30,
"includeComments": True,
"maxCommentsPerStory": 100,
},
)
Use Case: Developer Sentiment Analysis
Ask HN threads like "What are you working on?", "Who is hiring?", and "What's your tech stack?" are longitudinal datasets. Scrape them monthly and track how technology preferences shift over time.
The comment data comes back structured with parent-child relationships:
{
"id": 42345679,
"author": "commenter1",
"text": "We switched from Redis to Valkey last month and haven't looked back.",
"createdAt": "2026-03-15T11:00:00.000Z",
"parentId": 42345678
}
Mode: Show HN
"Show HN" posts are project launches and demos. This is where founders announce new tools, and the community provides brutally honest feedback.
run_resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={"category": "show", "maxItems": 50},
)
Use Case: Competitive Intelligence
Monitor Show HN for launches in your space. Track which products get upvoted (market validation) and read the comments for feature requests and criticism. This is free market research from your exact target audience.
Mode: Jobs
HN's job board, curated by YC companies and the broader community.
run_resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={"category": "jobs", "maxItems": 100},
)
Use Case: Hiring Market Analysis
Track which technologies appear most in job postings over time. Identify which YC companies are scaling (more job posts = more funding traction).
Mode: Search
Full-text search across all HN content via the Algolia API. This is the most powerful mode for targeted data collection.
# Search for posts about "AI agents" frameworks
run_resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"category": "search",
"searchQuery": "AI agents framework",
"maxItems": 200,
},
)
Search Query Tips
-
Exact phrases:
"machine learning"matches the exact phrase - Boolean: queries support AND/OR logic via Algolia syntax
- By date: combine with Algolia's date parameters for time-filtered searches
-
By points: the Algolia API supports
numericFiltersfor filtering by score
Getting User Profiles
Set scrapeType to "users" or "both" to get author profiles alongside stories:
run_resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"category": "best",
"maxItems": 50,
"scrapeType": "both",
},
)
User Profile Output
{
"username": "techfounder",
"karma": 15420,
"about": "Building cool stuff. Previously at BigCorp.",
"createdAt": "2015-03-10T08:00:00.000Z",
"submittedCount": 892
}
High-karma users with long account histories are the opinion leaders. Their comments carry disproportionate weight in discussions.
Scheduling Daily Collection
Set up automated daily scrapes for ongoing monitoring:
import json
schedule_resp = requests.post(
"https://api.apify.com/v2/schedules",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"name": "hn-daily-top-stories",
"cronExpression": "0 8 * * *",
"timezone": "UTC",
"actions": [{
"type": "RUN_ACTOR",
"actorId": "cryptosignals/hackernews-scraper",
"runInput": {
"body": json.dumps({
"category": "top",
"maxItems": 100,
"includeComments": True,
"maxCommentsPerStory": 50,
}),
"contentType": "application/json",
},
}],
},
)
Each run creates a new dataset. Over time, you build a historical archive of what was trending on HN each day — invaluable for trend analysis and retrospective research.
Using Proxies for Scale
The official HN Firebase API has no documented rate limit, and the Apify actor handles concurrent requests with sensible batching. However, if you're combining HN data collection with scraping from other platforms in your pipeline, you'll want a proxy rotation layer.
ThorData provides residential and datacenter proxy pools with automatic rotation — useful for mixed scraping pipelines where you're hitting multiple sources at different rate limit thresholds.
Practical Analysis Examples
Trend Tracking Dashboard
import pandas as pd
# Assume `stories` is a list of story dicts from the scraper
df = pd.DataFrame(stories)
df["createdAt"] = pd.to_datetime(df["createdAt"])
df["date"] = df["createdAt"].dt.date
# Daily story volume and average score
daily = df.groupby("date").agg(
story_count=("id", "count"),
avg_score=("score", "mean"),
total_comments=("commentCount", "sum"),
).reset_index()
print(daily.to_string(index=False))
Technology Mention Tracker
# Search for specific technologies and compare volume
technologies = ["Rust", "Go", "Python", "TypeScript", "Zig"]
for tech in technologies:
resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs?waitForFinish=120",
headers={
"Authorization": f"Bearer {APIFY_TOKEN}",
"Content-Type": "application/json",
},
json={
"category": "search",
"searchQuery": tech,
"maxItems": 100,
},
)
items = requests.get(
f"https://api.apify.com/v2/datasets/{resp.json()['data']['defaultDatasetId']}/items",
headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
).json()
avg_score = sum(i.get("score", 0) for i in items) / max(len(items), 1)
print(f"{tech}: {len(items)} posts, avg score {avg_score:.0f}")
Why Use an Apify Actor vs. Direct API
You might wonder: if the HN API is free and public, why not just call it directly?
| Concern | Direct API | Apify Actor |
|---|---|---|
| Fetching stories | 500 individual HTTP calls for 500 stories | Single API call, actor handles batching |
| Comment trees | Recursive fetching with parent/child resolution | Built-in with depth limits |
| Search | Algolia pagination loop | Handled automatically |
| User profiles | Separate fetch per username | Batch extraction with deduplication |
| Export | JSON only (build CSV yourself) | JSON, CSV, Excel, XML |
| Scheduling | You manage cron, hosting, retries | Built-in cron with monitoring |
| Error handling | Build retry logic | Automatic retries and alerts |
For a quick one-off query, curl to the Firebase API is perfect. For production data pipelines — daily collection, comment extraction, multi-query analysis — the actor saves significant engineering time.
Getting Started
- Create a free Apify account
- Open the Hacker News Scraper
- Pick a category, set your limits, click Start
- Download results as JSON, CSV, or Excel
For the Python examples above, grab your API token from the Apify Console and replace YOUR_APIFY_API_TOKEN.
Hacker News has been the tech industry's town square for nearly 20 years. The data is public, structured, and free to access. Whether you're tracking trends, monitoring competitors, or studying how technical communities form opinions, the data is there — and now you know how to collect it at scale.
The Hacker News Scraper is available at apify.com/cryptosignals/hackernews-scraper. Built on the official HN Firebase API and Algolia HN Search.
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.