DEV Community

agenthustler
agenthustler

Posted on

Hacker News API vs Web Scraping: Complete Data Extraction Guide 2026

Hacker News (HN) is the tech industry's watercooler — where startup founders, developers, and VCs share and discuss the latest in technology. Extracting HN data is valuable for trend analysis, competitive intelligence, and understanding what the developer community cares about.

In this guide, I'll compare using the official HN API versus web scraping, with practical Python examples for both.

Why Extract Hacker News Data?

  • Tech trend analysis: Spot emerging technologies and frameworks
  • Startup intelligence: Track which startups get attention
  • Content strategy: Find topics that resonate with developers
  • Hiring signals: Monitor "Who is Hiring?" threads
  • Sentiment tracking: Gauge developer opinion on tools and platforms

Method 1: The Official Hacker News API

HN provides a free Firebase-based API — no authentication required:

import requests
import time
from concurrent.futures import ThreadPoolExecutor

HN_API = "https://hacker-news.firebaseio.com/v0"

def get_top_stories(limit=30):
    """Fetch top stories from HN API."""
    response = requests.get(f"{HN_API}/topstories.json")
    story_ids = response.json()[:limit]

    stories = []
    for story_id in story_ids:
        item = requests.get(f"{HN_API}/item/{story_id}.json").json()
        if item:
            stories.append({
                "id": item.get("id"),
                "title": item.get("title", ""),
                "url": item.get("url", ""),
                "score": item.get("score", 0),
                "by": item.get("by", ""),
                "time": item.get("time", 0),
                "descendants": item.get("descendants", 0),
                "hn_url": f"https://news.ycombinator.com/item?id={item['id']}",
            })
    return stories


def get_item_with_comments(item_id, max_depth=3):
    """Fetch an item and its comment tree."""
    item = requests.get(f"{HN_API}/item/{item_id}.json").json()
    if not item:
        return None

    result = {
        "id": item["id"],
        "title": item.get("title", ""),
        "text": item.get("text", ""),
        "by": item.get("by", ""),
        "score": item.get("score", 0),
        "comments": [],
    }

    if max_depth > 0 and "kids" in item:
        for kid_id in item["kids"][:20]:  # Limit comments
            comment = get_item_with_comments(kid_id, max_depth - 1)
            if comment:
                result["comments"].append(comment)
            time.sleep(0.1)

    return result
Enter fullscreen mode Exit fullscreen mode

Faster API Fetching with Threading

The HN API requires one request per item, which is slow. Speed it up with threading:

def fetch_stories_parallel(story_ids, max_workers=10):
    """Fetch multiple stories in parallel."""
    def fetch_one(story_id):
        try:
            resp = requests.get(f"{HN_API}/item/{story_id}.json", timeout=10)
            return resp.json()
        except Exception:
            return None

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(fetch_one, story_ids))

    return [r for r in results if r is not None]
Enter fullscreen mode Exit fullscreen mode

Method 2: Web Scraping

For data not available through the API (like the rendered front page with rankings):

from bs4 import BeautifulSoup

def scrape_front_page():
    """Scrape the HN front page with rank positions."""
    response = requests.get("https://news.ycombinator.com/")
    soup = BeautifulSoup(response.text, "html.parser")

    stories = []
    rows = soup.select("tr.athing")

    for row in rows:
        rank_el = row.select_one(".rank")
        title_el = row.select_one(".titleline > a")
        site_el = row.select_one(".sitestr")

        # The subtext row follows each story row
        subtext = row.find_next_sibling("tr")
        score_el = subtext.select_one(".score") if subtext else None
        author_el = subtext.select_one(".hnuser") if subtext else None
        comments_el = subtext.select("a")[-1] if subtext else None

        stories.append({
            "rank": int(rank_el.text.strip(".")) if rank_el else 0,
            "title": title_el.text if title_el else "",
            "url": title_el["href"] if title_el else "",
            "site": site_el.text if site_el else "",
            "score": int(score_el.text.split()[0]) if score_el else 0,
            "author": author_el.text if author_el else "",
            "comments": comments_el.text if comments_el else "",
            "hn_id": row.get("id", ""),
        })

    return stories
Enter fullscreen mode Exit fullscreen mode

API vs Web Scraping: When to Use Each

Feature API Web Scraping
Rate limits ~30 req/s Needs proxies at scale
Real-time data Yes Yes
Historical data Item by ID Not available
Rankings/positions No Yes
Auth required No No
Comment trees Yes (slow) Complex to parse

Monitoring Trends Over Time

import json
from datetime import datetime

def track_daily_trends(output_file="hn_trends.jsonl"):
    """Capture a daily snapshot of HN top stories."""
    stories = get_top_stories(limit=30)

    snapshot = {
        "timestamp": datetime.utcnow().isoformat(),
        "stories": stories,
    }

    with open(output_file, "a") as f:
        f.write(json.dumps(snapshot) + "\n")

    print(f"Saved {len(stories)} stories at {snapshot['timestamp']}")
    return stories
Enter fullscreen mode Exit fullscreen mode

Production Solutions

For production HN data extraction, there are pre-built scrapers on Apify:

For scaling your own HN scraper, ScraperAPI handles proxy rotation to avoid rate limiting when making thousands of API calls.

Analyzing "Who is Hiring?" Threads

import re

def parse_hiring_thread(item_id):
    """Extract job listings from monthly Who is Hiring threads."""
    thread = get_item_with_comments(item_id, max_depth=1)
    jobs = []

    for comment in thread.get("comments", []):
        text = comment.get("text", "")
        if not text:
            continue

        # Extract company name (usually first line)
        first_line = text.split("<p>")[0] if "<p>" in text else text[:100]

        # Look for common patterns
        remote = bool(re.search(r"\bremote\b", text, re.I))
        location = re.search(r"\|\s*([^|]+?)\s*\|", first_line)

        jobs.append({
            "company": first_line[:80],
            "remote": remote,
            "location": location.group(1) if location else "",
            "text": text[:500],
            "by": comment.get("by", ""),
        })

    return jobs
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Prefer the API for structured data — it's official and reliable
  2. Use web scraping only for ranking data or rendered content
  3. Thread your API calls — HN requires one request per item
  4. Cache aggressively — stories don't change much after the first hour
  5. Use ScraperAPI if you're making thousands of requests

Conclusion

Hacker News offers both a clean API and scrapable HTML, making it one of the easiest tech community data sources to work with. Whether you use the API, web scraping, or managed solutions like the HN Scraper on Apify, the data is incredibly valuable for tech trend analysis.

Happy hacking!

Top comments (0)