DEV Community

agenthustler
agenthustler

Posted on

Scraping Hacker News in 2026: Complete Guide (Algolia API + Date Filters)

Why Scrape Hacker News?

Hacker News gets 10M+ visits/month from developers, founders, and investors. Whether you're tracking trends, monitoring mentions of your product, or building a dataset for analysis — HN data is gold.

The good news: HN has an official search API powered by Algolia. Most devs don't know about it. Let me show you how to use it properly.

The Algolia Search API

Base URL: https://hn.algolia.com/api/v1/

Fetching Stories by Type

import requests

# Search stories
resp = requests.get("https://hn.algolia.com/api/v1/search", params={
    "query": "LLM",
    "tags": "story",
    "hitsPerPage": 20
})
stories = resp.json()["hits"]

for s in stories:
    print(f"{s[\"points\"]}pts - {s[\"title\"]} ({s[\"url\"]})")
Enter fullscreen mode Exit fullscreen mode

The tags parameter controls what you get:

Tag What it returns
story All stories
ask_hn Ask HN posts
show_hn Show HN posts
comment Comments only
front_page Currently on front page

Date Filtering with numeric_filters

This is the killer feature most people miss. You can filter by Unix timestamp:

import time

# Stories from the last 24 hours
yesterday = int(time.time()) - 86400
resp = requests.get("https://hn.algolia.com/api/v1/search_by_date", params={
    "tags": "story",
    "numericFilters": f"created_at_i>{yesterday}",
    "hitsPerPage": 50
})
Enter fullscreen mode Exit fullscreen mode

You can combine filters: created_at_i>X,created_at_i<Y,points>100 gives you highly-voted stories in a date range.

Sorting: Relevance vs Date

Two endpoints handle this:

  • /search — sorted by relevance (default)
  • /search_by_date — sorted by date (newest first)

For monitoring use cases (tracking mentions, watching trends), search_by_date is what you want.

Fetching Comment Trees

Each story has a nested comment tree. You can fetch it by item ID:

# Get full item with comments
item = requests.get(
    f"https://hn.algolia.com/api/v1/items/{story_id}"
).json()

def walk_comments(children, depth=0):
    for c in children:
        print("  " * depth + c.get("text", "")[:80])
        walk_comments(c.get("children", []), depth + 1)

walk_comments(item.get("children", []))
Enter fullscreen mode Exit fullscreen mode

User Profiles

user = requests.get(
    "https://hn.algolia.com/api/v1/users/pg"
).json()
print(f"Karma: {user[\"karma\"]}, Account created: {user[\"created_at\"]}")
Enter fullscreen mode Exit fullscreen mode

Domain Extraction

Want to find all HN submissions from a specific domain? The API supports this natively:

/search?query=&tags=story&restrictSearchableAttributes=url&query=techcrunch.com
Enter fullscreen mode Exit fullscreen mode

This is incredibly useful for competitive analysis — see how your competitors' content performs on HN.

Rate Limits and Gotchas

The Algolia API is generous but has limits:

  • 10,000 requests/hour (IP-based)
  • Max 1,000 results per query (pagination tops out at page 50 × 20 hits)
  • No bulk export endpoint — you need to paginate through date windows

For anything beyond light usage, you'll hit pagination limits fast. If you need full historical data or continuous monitoring, that's where purpose-built tools come in.

Scaling Up: Automated HN Scraping

For production use cases — daily trend reports, mention monitoring, or building datasets — I built an Apify actor for Hacker News scraping that handles:

  • All story types (top, new, best, Ask HN, Show HN)
  • Date range filtering with automatic pagination
  • sortBy parameter (relevance, date, points)
  • Domain extraction from submitted URLs
  • Full comment trees with nested replies
  • Built-in proxy rotation and retry logic

It runs on Apify's infrastructure so you don't need to manage rate limits or pagination yourself.

Quick Reference

Use Case Endpoint Key Params
Search stories /search query, tags=story
Recent stories /search_by_date tags=story, numericFilters
Front page /search tags=front_page
Comments on story /items/{id}
User profile /users/{username}
Domain filter /search restrictSearchableAttributes=url

HN's API is one of the best-kept secrets in the scraping world. For quick scripts, it's all you need. For production pipelines, pair it with proper infrastructure and you're set.

Top comments (0)