agenthustler

Posted on Mar 20

Best Hacker News Scrapers in 2026: Algolia API vs Apify Actors Compared

#hackernews #webscraping #api #python

Hacker News is arguably the most influential tech community on the internet. With 10+ million monthly visitors, it's where startup founders, VCs, and senior engineers surface the links that shape the industry. If you're doing market research, competitive intelligence, or trend analysis — scraping HN data is a smart move.

But how should you do it? In this guide, I'll compare the main approaches: the official Algolia API, dedicated Apify actors, and when each makes sense. I built one of these actors myself, so I'll be upfront about that — full disclosure throughout.

Option 1: The Algolia HN Search API (Free, No Auth)

Before reaching for any scraping tool, know this: Hacker News has a free, public search API powered by Algolia. No API key required.

Searching stories

import requests

# Search for stories mentioning "LLM"
resp = requests.get("https://hn.algolia.com/api/v1/search", params={
    "query": "LLM",
    "tags": "story",
    "hitsPerPage": 10,
    "numericFilters": "created_at_i>1704067200"  # After Jan 2024
})

for hit in resp.json()["hits"]:
    print(f'{hit["points"]} pts | {hit["title"]}')
    print(f'  {hit.get("url", "N/A")}')

Fetching comments on a story

# Get all comments on a specific story
resp = requests.get("https://hn.algolia.com/api/v1/items/38877423")
story = resp.json()

def walk_comments(children, depth=0):
    for c in children:
        if c.get("text"):
            print(f'{"  " * depth}{c["author"]}: {c["text"][:80]}...')
        walk_comments(c.get("children", []), depth + 1)

walk_comments(story.get("children", []))

The Algolia API covers full-text search, date filtering via unix timestamps, and individual item lookup. Rate limits are generous (10,000 requests/hour). For many use cases, this is all you need.

When Algolia falls short: no bulk export, no scheduled monitoring, no domain extraction from story URLs, limited to 1,000 results per query, and no way to track front page rankings over time.

Option 2: Apify Actors — The Scraper Marketplace

When you need more than what the API offers — bulk data, scheduled runs, webhooks, or enriched output — Apify actors fill the gap. Here's the HN scraper landscape as of March 2026:

Actor	Users	30-Day Runs	Pricing	Standout Feature
epctex/hackernews-scraper	160	474	$10/mo	Most established, 5★ (8 reviews)
lucen_data/hacker-news-data-scraper	59	780	$0.001/result	Activity monitoring + Slack alerts
red.cars/hackernews-scraper-pro	18	30	$19/mo	No proxy required
fearless_sharpener/hacker-news-top-sites-scraper	17	29	$5/mo	Top sites focused
shahidirfan/hacker-news-data-scraper	16	80	Free	Basic scraping, zero cost
cryptosignals/hackernews-scraper	5	—	Free	Date filtering, sort modes, domain extraction

The market leader, epctex, has been live since 2022 and has the most users and reviews by far. Lucen Data stands out for activity monitoring and Slack integration — useful if you want alerts when a specific topic starts trending. shahidirfan offers a solid free option for basic needs.

Spotlight: cryptosignals/hackernews-scraper

Full disclosure: I built this one. It's free and focuses on features the Algolia API doesn't provide natively:

Date range filtering — fetch stories from a specific time window with simple date strings, not unix timestamps
Multiple sort modes — by date, score, or number of comments
Domain extraction — automatically parses the source domain from each story URL
Story ID input — pass specific HN story IDs for targeted data retrieval
5,000 result limit — go beyond Algolia's 1,000-per-query cap
Enhanced output — clean JSON with all metadata including descendants count and extracted domains

Python example

from apify_client import ApifyClient

client = ApifyClient("your_apify_token")

run = client.actor("cryptosignals/hackernews-scraper").call(run_input={
    "search": "AI agents",
    "sort": "byPopularity",
    "dateFrom": "2026-01-01",
    "dateTo": "2026-03-20",
    "maxResults": 100
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f'{item["points"]} pts | {item["title"]}')
    print(f'  Domain: {item.get("domain", "N/A")}')
    print(f'  Comments: {item.get("num_comments", 0)}')

Is it the most popular? Not by a long shot — epctex has 30x more users. But if you need date filtering with human-readable dates and automatic domain extraction without a monthly subscription, it's worth a look.

When to Use What: Decision Guide

Use the Algolia API when:

You need real-time search with sub-second response times
Your queries fit within 1,000 results
You're building an app that queries HN on-demand
You don't need scheduled data collection
You're comfortable working with unix timestamps for dates

Use an Apify actor when:

You need bulk data export (thousands of stories at once)
You want scheduled scraping with webhooks or Slack alerts
You need enriched output (domain extraction, activity scores)
You're feeding data into a pipeline (Apify integrates with Google Sheets, Slack, Zapier, n8n, and more)
You want a managed solution without maintaining infrastructure

Build your own scraper when:

You need to track actual front page rankings over time (not available via any API)
You have very specific data transformation requirements
You're comfortable maintaining your own infrastructure and handling rate limits

The HN Official API (Firebase)

Worth mentioning: HN also has an official Firebase API that returns individual items by ID. It's real-time but has no search — you'd need to poll item IDs sequentially. It's best suited for building live HN clients, not for data extraction.

Conclusion

The HN data ecosystem is surprisingly well-served. The Algolia API handles 80% of use cases for free, with zero setup. For the remaining 20% — bulk export, scheduling, monitoring, enriched output — Apify actors provide turnkey solutions at various price points.

My recommendation: start with the Algolia API. If you hit its limits (1,000 results, no scheduling, no domain extraction), then look at the actors. The epctex actor is the most battle-tested choice. For a free alternative with date filtering and domain extraction, give mine a try.

The best scraper is the one that fits your workflow. Don't over-engineer it.

DEV Community