agenthustler

Posted on Mar 16 • Edited on Apr 19 • Originally published at thedatacollector.substack.com

How to Scrape Hacker News in 2026: Stories, Comments, and Trends

#webscraping #python #hackernews #tutorial

How the Hacker News API Works

Hacker News exposes a public REST API built on Google's Firebase. It requires no authentication, no API keys, and no developer application. You can query it right now with a single curl command.

The base URL is https://hacker-news.firebaseio.com/v0/. All responses are JSON. Because it is built on Firebase's real-time database, every item — stories, comments, job posts, polls — lives at a predictable URL based on its integer ID.

Here are the core endpoints:

Top, New, and Best Story Lists

These endpoints return an array of item IDs sorted by their respective ranking algorithm:

# Top stories (front page ranking)
curl "https://hacker-news.firebaseio.com/v0/topstories.json"

# Newest submissions
curl "https://hacker-news.firebaseio.com/v0/newstories.json"

# Best stories (long-term quality ranking)
curl "https://hacker-news.firebaseio.com/v0/beststories.json"

# Ask HN posts
curl "https://hacker-news.firebaseio.com/v0/askstories.json"

# Show HN posts
curl "https://hacker-news.firebaseio.com/v0/showstories.json"

# Job postings
curl "https://hacker-news.firebaseio.com/v0/jobstories.json"

Each of these returns an array of up to 500 item IDs. The top stories endpoint returns the current front page ordering, which changes every few minutes as votes and time decay interact.

Item Details

Once you have an item ID, you can fetch full details for any story, comment, or job post:

curl "https://hacker-news.firebaseio.com/v0/item/43211234.json"

A story response looks like this:

{
  "id": 43211234,
  "type": "story",
  "title": "SQLite is not a toy database",
  "url": "https://antonz.org/sqlite-is-not-a-toy-database/",
  "by": "thunderbong",
  "score": 847,
  "descendants": 142,
  "kids": [43211501, 43211398, 43211287],
  "time": 1741234567
}

The kids field contains IDs of top-level comments. Each comment is itself an item, and may have its own kids — so fetching a full comment thread means recursively walking a tree of IDs and making an API call for each node.

User Profiles

Fetch a user's profile by username:

curl "https://hacker-news.firebaseio.com/v0/user/pg.json"

Response:

{
  "id": "pg",
  "created": 1160418092,
  "karma": 155111,
  "about": "Co-founder of Viaweb and Y Combinator...",
  "submitted": [43198765, 43187654, 43156789]
}

The submitted array contains IDs of every item the user has ever posted or commented on, most recent first. For prolific users, this array can have tens of thousands of entries.

Search via Algolia

HN's own search is powered by Algolia and exposes a separate API endpoint that supports full-text search, date filtering, and result ranking:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

The Algolia endpoint returns richer metadata than the Firebase API — including highlighted text matches, story URLs, and author information — all in a single response per page.

Endpoint	Purpose	Pagination
`search`	Best match ranking	`page` parameter
`search_by_date`	Chronological	`page` parameter
Tags available: `story`, `comment`, `poll`, `job`, `ask_hn`, `show_hn`

The Pagination and Rate Limit Problem

The direct API is excellent for simple lookups. The problems start when you try to collect data at any meaningful scale.

The fan-out problem. The story list endpoints give you IDs, not content. To get 500 top stories with their metadata, you need 500 individual HTTP requests to the item endpoint — one per story. To get the comments on those 500 stories, you need additional requests for every comment ID in every kids array. A single thread with 200 comments might require 200+ additional requests to fully traverse. For the full front page with comments, you are easily looking at 5,000–10,000 API calls.

No bulk endpoint. There is no way to say "give me the top 100 stories with their metadata in one request." The Firebase API is intentionally simple: one ID lookup per request. You build the fan-out logic yourself.

Rate limits. Firebase does not publish explicit rate limits for the HN API, but aggressive concurrent requests will result in connection refusals or throttling. Production scrapers need exponential backoff, connection pooling, and retry logic to work reliably.

Comment tree traversal. Comments are nested arbitrarily deep. Fetching a complete thread means walking the tree recursively, and you can't know the depth in advance. A top-level comment might have no replies or it might have a 15-level-deep argument about tabs versus spaces.

No date filtering on the Firebase API. If you want stories from a specific date range using the Firebase API, you have to fetch IDs, fetch each item, check the timestamp, and discard anything out of range. There is no server-side filter.

The Algolia search endpoint solves some of these problems (date filtering, full-text search), but introduces its own: it is an index, not the live database, so there is a sync delay. And you still get paginated results that require cursor management for large collections.

This is exactly the kind of infrastructure problem that an Apify Actor is built to solve.

Using the Hacker News Scraper on Apify

The Hacker News Scraper is a ready-to-use Actor that handles the fan-out, pagination, rate limiting, and tree traversal for you. You configure what you want, click run, and get structured data back.

Step 1: Open the Actor

Go to apify.com/cryptosignals/hackernews-scraper. You can run it from the web console with no code required.

Step 2: Configure Your Search

In the Actor's input configuration, set your parameters:

Search terms — keywords or phrases to search for across HN stories. You can pass multiple terms and they will be processed in parallel.
Max results — cap the total number of items returned. Useful for controlling cost and runtime during testing.
Result type — choose to collect stories, comments, or both. For recruiting use cases, you typically want stories only. For sentiment analysis, you want comments.

Step 3: Run and Export

Click Start. Depending on the volume requested, runs typically complete in seconds to a few minutes. When finished, export results in any of:

JSON — for programmatic consumption and data pipelines
CSV — for spreadsheets, Excel, and BI tools like Tableau or Looker
Excel — for business stakeholders who want to open it directly
XML — for legacy system integration

Each story record includes the title, URL, author, score, comment count, submission timestamp, and story type. Comment records include the text (as HTML), author, timestamp, parent item ID, and nesting depth.

Running the Actor Programmatically

The web console is the right starting point, but automation requires calling the Actor from code. The Apify client libraries wrap the REST API and handle polling for run completion.

JavaScript (Node.js)

Install the client:

npm install apify-client

Run the Actor and process results:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_APIFY_API_TOKEN',
});

// Run the HN Scraper actor
const run = await client.actor('cryptosignals/hackernews-scraper').call({
    searchTerms: ['rust programming', 'webassembly', 'llm inference'],
    maxResults: 300,
});

// Fetch results from the default dataset
const { items } = await client
    .dataset(run.defaultDatasetId)
    .listItems();

console.log(`Collected ${items.length} items`);

// Process each story
for (const story of items) {
    console.log(`[${story.score} pts] ${story.title}`);
    console.log(`  URL: ${story.url}`);
    console.log(`  By: ${story.by} | Comments: ${story.descendants}`);
    console.log(`  Posted: ${new Date(story.time * 1000).toISOString()}`);
    console.log('---');
}

Python

Install the client:

pip install apify-client

Run and fetch results:

from apify_client import ApifyClient
from datetime import datetime

client = ApifyClient('YOUR_APIFY_API_TOKEN')

# Configure and run the actor
run = client.actor('cryptosignals/hackernews-scraper').call(run_input={
    'searchTerms': ['rust programming', 'webassembly', 'llm inference'],
    'maxResults': 300,
})

# Fetch results
dataset_items = client.dataset(run['defaultDatasetId']).list_items().items

print(f'Collected {len(dataset_items)} items')

for story in dataset_items:
    posted = datetime.fromtimestamp(story['time']).strftime('%Y-%m-%d')
    print(f"[{story.get('score', 0)} pts] {story['title']}")
    print(f"  URL: {story.get('url', 'N/A')}")
    print(f"  By: {story['by']} | Comments: {story.get('descendants', 0)}")
    print(f"  Posted: {posted}")
    print('---')

Both examples follow the same pattern: initialize the client with your API token, call the Actor with your desired input, wait for completion (the client handles polling automatically), then iterate over the dataset. You get your data back as a Python list of dictionaries or a JavaScript array of objects — no JSON parsing, no cursor management, no retry logic to write.

Scheduling Recurring Scrapes

Most HN monitoring use cases require ongoing collection, not one-off runs. Apify's scheduling system supports any cron expression, and each scheduled run creates a fresh dataset — giving you a clean time-series record of results.

Setting Up a Schedule via the Console

Open the Schedules section in Apify Console
Click Create new schedule
Select the Hacker News Scraper actor
Set your cron expression — for example, 0 7 * * * to run daily at 7:00 AM UTC
Paste your Actor input (search terms, max results)
Save the schedule

For "Who's Hiring" threads — which Y Combinator posts on the first weekday of every month — you might set a monthly schedule to capture and archive those threads systematically.

Setting Up a Schedule via the API

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_APIFY_API_TOKEN',
});

// Create a schedule that runs every weekday morning at 7:00 AM UTC
const schedule = await client.schedules().create({
    name: 'hn-daily-tech-monitor',
    cronExpression: '0 7 * * 1-5',
    timezone: 'UTC',
    actions: [{
        type: 'RUN_ACTOR',
        actorId: 'cryptosignals/hackernews-scraper',
        runInput: {
            body: JSON.stringify({
                searchTerms: ['your competitor name', 'your product category'],
                maxResults: 200,
            }),
            contentType: 'application/json',
        },
    }],
});

console.log(`Schedule created: ${schedule.id}`);

Connecting to Downstream Systems

Configure a webhook on the Actor run to trigger the moment a scrape finishes. A typical pipeline looks like:

Scheduled Actor run completes
Webhook fires to your endpoint (or an Apify webhook integration)
Your system reads the dataset via the Apify API
New records land in your database, Slack channel, or Google Sheet

This gives you a fully automated HN monitoring pipeline with no servers to manage.

Practical Use Cases

Recruiting — "Who's Hiring" Threads

Y Combinator posts a "Who's Hiring" thread on the first business day of every month. These threads contain hundreds of job listings with direct contact information for hiring managers — no recruiter intermediary, no applicant tracking system. Scraping these threads monthly gives you a structured database of active technical hiring, segmented by company, role, tech stack, and location. Search for "Who is Hiring" or "Who's Hiring" to retrieve the thread, then collect all comments.

Trend Detection

HN surfaces early-stage technology discussions before they hit mainstream media. Run a weekly scrape across a set of emerging technology keywords and track score trajectories over time. A story about a new database, programming language, or infrastructure tool that scores 500+ points in its first hour is a meaningful signal that the technical community has noticed something.

For trend analysis, sort by score descending and look at the ratio of points to comments — high points with low comments often indicates strong signal (people upvote and move on), while high comments with lower points may indicate controversy.

Competitive Intelligence

Search for your competitors by name. Collect every HN thread where they are mentioned, along with the comments. Comments on HN tend to be technically sophisticated and candid — you will find honest assessments of product tradeoffs, complaints about pricing, and comparisons to alternatives that you will never see in a vendor's marketing materials.

Track these threads over time to spot when sentiment shifts, when a new competitor enters a discussion, or when a feature gap starts appearing repeatedly in comments.

Sentiment Analysis and NLP

HN comments are particularly good training data and evaluation sets for technical NLP tasks because the writing is dense, opinionated, and domain-specific. For sentiment analysis on developer tooling, HN comments are a better signal than Twitter/X or Reddit because the audience is more homogenous and the discussions are more focused.

Collect comments for a specific story or keyword, strip the HTML tags from the text field, and you have clean input for any standard NLP pipeline.

Academic Research

Researchers studying technical communities, online discourse, innovation diffusion, or information cascades have used HN as a primary dataset. The public API and the Algolia search index make it one of the more accessible large social datasets, and the long history (posts dating back to 2007) supports longitudinal studies.

Why Use the Actor vs. Direct API Calls

The Hacker News Firebase API is free and open — so why layer an Apify Actor on top of it?

Concern	Direct HN API	Apify Actor
Story metadata	500 separate HTTP requests for 500 stories	Single configured run
Full-text search	Not supported (Algolia endpoint required separately)	Built-in across all runs
Comment retrieval	Recursive tree traversal, N+1 requests per thread	Handled automatically
Rate limiting	Must implement backoff yourself	Built-in retry and backoff
Date filtering	Fetch-and-discard; no server-side filter on Firebase	Supported via Algolia integration
Export formats	JSON only	JSON, CSV, Excel, XML
Scheduling	You manage cron, hosting, error alerts	Built-in cron with monitoring
Error recovery	Build it yourself	Automatic retries and failure alerts
Storage	You provision and manage	Managed datasets with retention
Multiple search terms	Sequential loops	Parallel execution

For a single quick lookup — checking the current score on a specific item, fetching one user profile — the direct API is the right tool. It is a curl command. But for production data collection across multiple search terms, daily runs, comment tree traversal, and integration with downstream systems, writing and maintaining that infrastructure yourself is a significant engineering investment. The Actor packages all of it.

Getting Started

The shortest path from zero to a working HN dataset:

Create a free Apify account
Open the Hacker News Scraper
Enter your search terms and a result limit
Click Start and wait for the run to complete
Download your results in JSON or CSV

For recurring collection, add a schedule. For programmatic access, copy your API token from the Apify Console and use the code examples above directly in your project.

Hacker News is 17 years of high-quality technical discourse, freely accessible through a public API. Whether you are building a recruiting pipeline, a technology trend tracker, or a competitive intelligence dashboard, the dataset is there — you just need a reliable way to collect it at scale.

The Hacker News Scraper Actor is available at apify.com/cryptosignals/hackernews-scraper. For reference on the underlying API, see the official HN API documentation and the Algolia HN Search API.

Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
Scrape.do — From $29/mo, strong Cloudflare bypass
ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? Email hello@web-data-labs.com — fast turnaround, fair pricing.

📘 Get the Complete Web Scraping Playbook

Want the full guide? The Complete Web Scraping Playbook 2026 — 48 pages covering proxies, anti-bot bypass, stealth browsers, and production-ready architectures. Just $9.

DEV Community