DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Hacker News in 2026: Stories, Comments, Ask HN via API

Hacker News is one of the highest signal-to-noise communities on the internet. Started by Y Combinator in 2007, it's the default gathering place for founders, engineers, investors, and researchers who shape the tech industry. When a new framework gains traction, a startup gets acquired, or a security vulnerability drops — the discussion happens on HN first, often days before mainstream coverage.

That makes HN data genuinely valuable. Recruiters mine "Who's Hiring" threads. VCs track what technical audiences are excited about. Researchers study developer sentiment. Content marketers identify trending topics. And founders validate ideas by watching what the community upvotes.

In this guide, I'll walk through every mode of the Hacker News Scraper on Apify — top stories, newest, Ask HN, Show HN, jobs, and full-text search — with Python code for each. Plus how to extract comment trees and user profiles at scale.


Why Hacker News Data Is Valuable

HN isn't just another tech forum. The community has specific properties that make its data uniquely useful:

Tech opinion leaders. HN's user base skews heavily toward senior engineers, CTOs, and founders. When a post about a new tool hits 300+ points, that's signal from people who actually build production systems — not casual upvotes.

Early adopter signal. Technologies that trend on HN often go mainstream 6-12 months later. Docker, Kubernetes, and GPT-3 all had their breakout HN moments before the broader developer community caught on.

Unfiltered sentiment. Unlike LinkedIn or Twitter where people perform for their network, HN comments tend to be technically honest. Negative feedback on a product launch here is more informative than five-star reviews elsewhere.

Structured data. Every story has a score, comment count, author, timestamp, and URL. Every comment has parent-child relationships. This makes HN data ideal for quantitative analysis without extensive preprocessing.


How the Hacker News API Works

HN exposes two public APIs — both free, both unauthenticated:

1. Firebase API (Official)

Base URL: https://hacker-news.firebaseio.com/v0/

This is the official API maintained by Y Combinator. Every item (stories, comments, jobs, polls) has an integer ID and lives at a predictable URL:

# Fetch a single item
curl "https://hacker-news.firebaseio.com/v0/item/1.json"

# Get current top story IDs
curl "https://hacker-news.firebaseio.com/v0/topstories.json"
Enter fullscreen mode Exit fullscreen mode

The Firebase API returns arrays of IDs for story lists, then you fetch each item individually. For 500 stories, that's 500+ HTTP requests — which is why batch processing with concurrency matters.

2. Algolia API (Search)

Base URL: https://hn.algolia.com/api/v1/

Algolia indexes all HN content and provides full-text search with filtering, sorting, and pagination:

# Search for posts about "rust programming"
curl "https://hn.algolia.com/api/v1/search?query=rust+programming&tags=story&hitsPerPage=50"
Enter fullscreen mode Exit fullscreen mode

This returns rich results with title, URL, points, comment count, author, and timestamps — all in one response.


The Scraper: 7 Modes Explained

The Hacker News Scraper wraps both APIs with production-grade logic: concurrent fetching, automatic pagination, comment tree extraction, and multiple export formats.

Input Parameters

Parameter Type Default Description
category string "top" top, new, best, ask, show, jobs, or search
searchQuery string "" Keyword query (required when category is search)
maxItems integer 100 Max stories to return (1-500)
includeComments boolean false Fetch full comment trees
maxCommentsPerStory integer 50 Max comments per story (1-500)
scrapeType string "stories" stories, users, or both

Mode: Top Stories

The default mode. Returns the current top stories ranked by HN's algorithm (a combination of points and time decay).

import requests
import time

APIFY_TOKEN = "YOUR_APIFY_API_TOKEN"
ACTOR_ID = "cryptosignals/hackernews-scraper"

# Start a run for top 50 stories
run_resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={"category": "top", "maxItems": 50},
)

run = run_resp.json()["data"]
run_id = run["id"]
dataset_id = run["defaultDatasetId"]

# Wait for completion
while True:
    status = requests.get(
        f"https://api.apify.com/v2/actor-runs/{run_id}",
        headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
    ).json()["data"]["status"]
    if status in ("SUCCEEDED", "FAILED", "ABORTED"):
        break
    time.sleep(3)

# Fetch results
stories = requests.get(
    f"https://api.apify.com/v2/datasets/{dataset_id}/items",
    headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
).json()

for s in stories[:5]:
    print(f"[{s['score']} pts] {s['title']}")
    print(f"  By: {s['author']} | Comments: {s['commentCount']}")
    print(f"  URL: {s.get('url', s['hnUrl'])}")
    print()
Enter fullscreen mode Exit fullscreen mode

Output Schema

{
  "id": 42345678,
  "title": "Show HN: I built an AI-powered code reviewer",
  "url": "https://example.com/project",
  "text": null,
  "author": "techfounder",
  "score": 342,
  "commentCount": 156,
  "createdAt": "2026-03-15T10:30:00.000Z",
  "storyType": "show",
  "hnUrl": "https://news.ycombinator.com/item?id=42345678"
}
Enter fullscreen mode Exit fullscreen mode

Mode: Newest Stories

Fetch the most recently submitted stories — useful for monitoring new submissions in real time.

run_resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={"category": "new", "maxItems": 100},
)
Enter fullscreen mode Exit fullscreen mode

Use Case: Link Monitoring

Run this every hour to catch when your company, product, or competitor gets submitted to HN. Combined with a webhook, you can get Slack notifications within minutes of a submission.


Mode: Ask HN

"Ask HN" posts are questions from the community — they have no external URL, just a text body. These are goldmines for understanding developer pain points.

# Get the latest 30 Ask HN posts with comments
run_resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={
        "category": "ask",
        "maxItems": 30,
        "includeComments": True,
        "maxCommentsPerStory": 100,
    },
)
Enter fullscreen mode Exit fullscreen mode

Use Case: Developer Sentiment Analysis

Ask HN threads like "What are you working on?", "Who is hiring?", and "What's your tech stack?" are longitudinal datasets. Scrape them monthly and track how technology preferences shift over time.

The comment data comes back structured with parent-child relationships:

{
  "id": 42345679,
  "author": "commenter1",
  "text": "We switched from Redis to Valkey last month and haven't looked back.",
  "createdAt": "2026-03-15T11:00:00.000Z",
  "parentId": 42345678
}
Enter fullscreen mode Exit fullscreen mode

Mode: Show HN

"Show HN" posts are project launches and demos. This is where founders announce new tools, and the community provides brutally honest feedback.

run_resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={"category": "show", "maxItems": 50},
)
Enter fullscreen mode Exit fullscreen mode

Use Case: Competitive Intelligence

Monitor Show HN for launches in your space. Track which products get upvoted (market validation) and read the comments for feature requests and criticism. This is free market research from your exact target audience.


Mode: Jobs

HN's job board, curated by YC companies and the broader community.

run_resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={"category": "jobs", "maxItems": 100},
)
Enter fullscreen mode Exit fullscreen mode

Use Case: Hiring Market Analysis

Track which technologies appear most in job postings over time. Identify which YC companies are scaling (more job posts = more funding traction).


Mode: Search

Full-text search across all HN content via the Algolia API. This is the most powerful mode for targeted data collection.

# Search for posts about "AI agents" frameworks
run_resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={
        "category": "search",
        "searchQuery": "AI agents framework",
        "maxItems": 200,
    },
)
Enter fullscreen mode Exit fullscreen mode

Search Query Tips

  • Exact phrases: "machine learning" matches the exact phrase
  • Boolean: queries support AND/OR logic via Algolia syntax
  • By date: combine with Algolia's date parameters for time-filtered searches
  • By points: the Algolia API supports numericFilters for filtering by score

Getting User Profiles

Set scrapeType to "users" or "both" to get author profiles alongside stories:

run_resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={
        "category": "best",
        "maxItems": 50,
        "scrapeType": "both",
    },
)
Enter fullscreen mode Exit fullscreen mode

User Profile Output

{
  "username": "techfounder",
  "karma": 15420,
  "about": "Building cool stuff. Previously at BigCorp.",
  "createdAt": "2015-03-10T08:00:00.000Z",
  "submittedCount": 892
}
Enter fullscreen mode Exit fullscreen mode

High-karma users with long account histories are the opinion leaders. Their comments carry disproportionate weight in discussions.


Scheduling Daily Collection

Set up automated daily scrapes for ongoing monitoring:

import json

schedule_resp = requests.post(
    "https://api.apify.com/v2/schedules",
    headers={
        "Authorization": f"Bearer {APIFY_TOKEN}",
        "Content-Type": "application/json",
    },
    json={
        "name": "hn-daily-top-stories",
        "cronExpression": "0 8 * * *",
        "timezone": "UTC",
        "actions": [{
            "type": "RUN_ACTOR",
            "actorId": "cryptosignals/hackernews-scraper",
            "runInput": {
                "body": json.dumps({
                    "category": "top",
                    "maxItems": 100,
                    "includeComments": True,
                    "maxCommentsPerStory": 50,
                }),
                "contentType": "application/json",
            },
        }],
    },
)
Enter fullscreen mode Exit fullscreen mode

Each run creates a new dataset. Over time, you build a historical archive of what was trending on HN each day — invaluable for trend analysis and retrospective research.


Using Proxies for Scale

The official HN Firebase API has no documented rate limit, and the Apify actor handles concurrent requests with sensible batching. However, if you're combining HN data collection with scraping from other platforms in your pipeline, you'll want a proxy rotation layer.

ThorData provides residential and datacenter proxy pools with automatic rotation — useful for mixed scraping pipelines where you're hitting multiple sources at different rate limit thresholds.


Practical Analysis Examples

Trend Tracking Dashboard

import pandas as pd

# Assume `stories` is a list of story dicts from the scraper
df = pd.DataFrame(stories)
df["createdAt"] = pd.to_datetime(df["createdAt"])
df["date"] = df["createdAt"].dt.date

# Daily story volume and average score
daily = df.groupby("date").agg(
    story_count=("id", "count"),
    avg_score=("score", "mean"),
    total_comments=("commentCount", "sum"),
).reset_index()

print(daily.to_string(index=False))
Enter fullscreen mode Exit fullscreen mode

Technology Mention Tracker

# Search for specific technologies and compare volume
technologies = ["Rust", "Go", "Python", "TypeScript", "Zig"]

for tech in technologies:
    resp = requests.post(
        f"https://api.apify.com/v2/acts/{ACTOR_ID}/runs?waitForFinish=120",
        headers={
            "Authorization": f"Bearer {APIFY_TOKEN}",
            "Content-Type": "application/json",
        },
        json={
            "category": "search",
            "searchQuery": tech,
            "maxItems": 100,
        },
    )
    items = requests.get(
        f"https://api.apify.com/v2/datasets/{resp.json()['data']['defaultDatasetId']}/items",
        headers={"Authorization": f"Bearer {APIFY_TOKEN}"},
    ).json()
    avg_score = sum(i.get("score", 0) for i in items) / max(len(items), 1)
    print(f"{tech}: {len(items)} posts, avg score {avg_score:.0f}")
Enter fullscreen mode Exit fullscreen mode

Why Use an Apify Actor vs. Direct API

You might wonder: if the HN API is free and public, why not just call it directly?

Concern Direct API Apify Actor
Fetching stories 500 individual HTTP calls for 500 stories Single API call, actor handles batching
Comment trees Recursive fetching with parent/child resolution Built-in with depth limits
Search Algolia pagination loop Handled automatically
User profiles Separate fetch per username Batch extraction with deduplication
Export JSON only (build CSV yourself) JSON, CSV, Excel, XML
Scheduling You manage cron, hosting, retries Built-in cron with monitoring
Error handling Build retry logic Automatic retries and alerts

For a quick one-off query, curl to the Firebase API is perfect. For production data pipelines — daily collection, comment extraction, multi-query analysis — the actor saves significant engineering time.


Getting Started

  1. Create a free Apify account
  2. Open the Hacker News Scraper
  3. Pick a category, set your limits, click Start
  4. Download results as JSON, CSV, or Excel

For the Python examples above, grab your API token from the Apify Console and replace YOUR_APIFY_API_TOKEN.

Hacker News has been the tech industry's town square for nearly 20 years. The data is public, structured, and free to access. Whether you're tracking trends, monitoring competitors, or studying how technical communities form opinions, the data is there — and now you know how to collect it at scale.


The Hacker News Scraper is available at apify.com/cryptosignals/hackernews-scraper. Built on the official HN Firebase API and Algolia HN Search.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.