agenthustler

Posted on Mar 25 • Edited on Apr 19

How to Scrape Hacker News in 2026: Stories, Comments, Ask HN via API

#python #webdev #tutorial #datascience

Hacker News is one of the highest signal-to-noise communities on the internet. Started by Y Combinator in 2007, it's the default gathering place for founders, engineers, investors, and researchers who shape the tech industry. When a new framework gains traction, a startup gets acquired, or a security vulnerability drops — the discussion happens on HN first, often days before mainstream coverage.

That makes HN data genuinely valuable. Recruiters mine "Who's Hiring" threads. VCs track what technical audiences are excited about. Researchers study developer sentiment. Content marketers identify trending topics. And founders validate ideas by watching what the community upvotes.

In this guide, I'll walk through every mode of the Hacker News Scraper on Apify — top stories, newest, Ask HN, Show HN, jobs, and full-text search — with Python code for each. Plus how to extract comment trees and user profiles at scale.

Why Hacker News Data Is Valuable

HN isn't just another tech forum. The community has specific properties that make its data uniquely useful:

Tech opinion leaders. HN's user base skews heavily toward senior engineers, CTOs, and founders. When a post about a new tool hits 300+ points, that's signal from people who actually build production systems — not casual upvotes.

Early adopter signal. Technologies that trend on HN often go mainstream 6-12 months later. Docker, Kubernetes, and GPT-3 all had their breakout HN moments before the broader developer community caught on.

Unfiltered sentiment. Unlike LinkedIn or Twitter where people perform for their network, HN comments tend to be technically honest. Negative feedback on a product launch here is more informative than five-star reviews elsewhere.

Structured data. Every story has a score, comment count, author, timestamp, and URL. Every comment has parent-child relationships. This makes HN data ideal for quantitative analysis without extensive preprocessing.

How the Hacker News API Works

HN exposes two public APIs — both free, both unauthenticated:

1. Firebase API (Official)

Base URL: https://hacker-news.firebaseio.com/v0/

This is the official API maintained by Y Combinator. Every item (stories, comments, jobs, polls) has an integer ID and lives at a predictable URL:

# Fetch a single item
curl "https://hacker-news.firebaseio.com/v0/item/1.json"

# Get current top story IDs
curl "https://hacker-news.firebaseio.com/v0/topstories.json"

The Firebase API returns arrays of IDs for story lists, then you fetch each item individually. For 500 stories, that's 500+ HTTP requests — which is why batch processing with concurrency matters.

2. Algolia API (Search)

Base URL: https://hn.algolia.com/api/v1/

Algolia indexes all HN content and provides full-text search with filtering, sorting, and pagination:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

This returns rich results with title, URL, points, comment count, author, and timestamps — all in one response.

The Scraper: 7 Modes Explained

The Hacker News Scraper wraps both APIs with production-grade logic: concurrent fetching, automatic pagination, comment tree extraction, and multiple export formats.

Input Parameters

Parameter	Type	Default	Description
`category`	string	`"top"`	`top`, `new`, `best`, `ask`, `show`, `jobs`, or `search`
`searchQuery`	string	`""`	Keyword query (required when category is `search`)
`maxItems`	integer	`100`	Max stories to return (1-500)
`includeComments`	boolean	`false`	Fetch full comment trees
`maxCommentsPerStory`	integer	`50`	Max comments per story (1-500)
`scrapeType`	string	`"stories"`	`stories`, `users`, or `both`

Mode: Top Stories

The default mode. Returns the current top stories ranked by HN's algorithm (a combination of points and time decay).

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Output Schema

{
  "id": 42345678,
  "title": "Show HN: I built an AI-powered code reviewer",
  "url": "https://example.com/project",
  "text": null,
  "author": "techfounder",
  "score": 342,
  "commentCount": 156,
  "createdAt": "2026-03-15T10:30:00.000Z",
  "storyType": "show",
  "hnUrl": "https://news.ycombinator.com/item?id=42345678"
}

Mode: Newest Stories

Fetch the most recently submitted stories — useful for monitoring new submissions in real time.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Use Case: Link Monitoring

Run this every hour to catch when your company, product, or competitor gets submitted to HN. Combined with a webhook, you can get Slack notifications within minutes of a submission.

Mode: Ask HN

"Ask HN" posts are questions from the community — they have no external URL, just a text body. These are goldmines for understanding developer pain points.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Use Case: Developer Sentiment Analysis

Ask HN threads like "What are you working on?", "Who is hiring?", and "What's your tech stack?" are longitudinal datasets. Scrape them monthly and track how technology preferences shift over time.

The comment data comes back structured with parent-child relationships:

{
  "id": 42345679,
  "author": "commenter1",
  "text": "We switched from Redis to Valkey last month and haven't looked back.",
  "createdAt": "2026-03-15T11:00:00.000Z",
  "parentId": 42345678
}

Mode: Show HN

"Show HN" posts are project launches and demos. This is where founders announce new tools, and the community provides brutally honest feedback.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Use Case: Competitive Intelligence

Monitor Show HN for launches in your space. Track which products get upvoted (market validation) and read the comments for feature requests and criticism. This is free market research from your exact target audience.

Mode: Jobs

HN's job board, curated by YC companies and the broader community.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Use Case: Hiring Market Analysis

Track which technologies appear most in job postings over time. Identify which YC companies are scaling (more job posts = more funding traction).

Mode: Search

Full-text search across all HN content via the Algolia API. This is the most powerful mode for targeted data collection.

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Search Query Tips

Exact phrases: "machine learning" matches the exact phrase
Boolean: queries support AND/OR logic via Algolia syntax
By date: combine with Algolia's date parameters for time-filtered searches
By points: the Algolia API supports numericFilters for filtering by score

Getting User Profiles

Set scrapeType to "users" or "both" to get author profiles alongside stories:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

User Profile Output

{
  "username": "techfounder",
  "karma": 15420,
  "about": "Building cool stuff. Previously at BigCorp.",
  "createdAt": "2015-03-10T08:00:00.000Z",
  "submittedCount": 892
}

High-karma users with long account histories are the opinion leaders. Their comments carry disproportionate weight in discussions.

Scheduling Daily Collection

Set up automated daily scrapes for ongoing monitoring:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Each run creates a new dataset. Over time, you build a historical archive of what was trending on HN each day — invaluable for trend analysis and retrospective research.

Using Proxies for Scale

The official HN Firebase API has no documented rate limit, and the Apify actor handles concurrent requests with sensible batching. However, if you're combining HN data collection with scraping from other platforms in your pipeline, you'll want a proxy rotation layer.

ThorData provides residential and datacenter proxy pools with automatic rotation — useful for mixed scraping pipelines where you're hitting multiple sources at different rate limit thresholds.

Practical Analysis Examples

Trend Tracking Dashboard

import pandas as pd

# Assume `stories` is a list of story dicts from the scraper
df = pd.DataFrame(stories)
df["createdAt"] = pd.to_datetime(df["createdAt"])
df["date"] = df["createdAt"].dt.date

# Daily story volume and average score
daily = df.groupby("date").agg(
    story_count=("id", "count"),
    avg_score=("score", "mean"),
    total_comments=("commentCount", "sum"),
).reset_index()

print(daily.to_string(index=False))

Technology Mention Tracker

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Why Use an Apify Actor vs. Direct API

You might wonder: if the HN API is free and public, why not just call it directly?

Concern	Direct API	Apify Actor
Fetching stories	500 individual HTTP calls for 500 stories	Single API call, actor handles batching
Comment trees	Recursive fetching with parent/child resolution	Built-in with depth limits
Search	Algolia pagination loop	Handled automatically
User profiles	Separate fetch per username	Batch extraction with deduplication
Export	JSON only (build CSV yourself)	JSON, CSV, Excel, XML
Scheduling	You manage cron, hosting, retries	Built-in cron with monitoring
Error handling	Build retry logic	Automatic retries and alerts

For a quick one-off query, curl to the Firebase API is perfect. For production data pipelines — daily collection, comment extraction, multi-query analysis — the actor saves significant engineering time.

Getting Started

Create a free Apify account
Open the Hacker News Scraper
Pick a category, set your limits, click Start
Download results as JSON, CSV, or Excel

For the Python examples above, grab your API token from the Apify Console and replace YOUR_APIFY_API_TOKEN.

Hacker News has been the tech industry's town square for nearly 20 years. The data is public, structured, and free to access. Whether you're tracking trends, monitoring competitors, or studying how technical communities form opinions, the data is there — and now you know how to collect it at scale.

The Hacker News Scraper is available at apify.com/cryptosignals/hackernews-scraper. Built on the official HN Firebase API and Algolia HN Search.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.