Devil Scrapes

Posted on Jun 1

Hacker News Scraper: export top, new, best, ask, show & jobs to JSON

#webscraping #python #apify #data

Quick answer: Hacker News publishes a documented Firebase API — but it returns item IDs, not data. To get top stories with titles, scores, authors, comment counts, and timestamps you have to fan out one HTTP call per story across all six feeds. A Hacker News scraper automates that fan-out and returns typed dataset rows you can export to JSON, CSV, or a warehouse. The Apify Actor below does it for $0.002 per story (~$2.00 per 1,000), with the concurrency and pacing handled for you.

Hacker News has an outsized cultural footprint for a site that's never shown a single ad. The front page has launched companies like Stripe, Dropbox, and Notion, and Show HN posts routinely produce a product's first traction. For anyone building VC scouting tools, trend monitors, RAG corpora, or newsletter pipelines, the HN data stream is valuable — but extracting it at scale takes more plumbing than it looks.

What is Hacker News? 🎯

Hacker News is a social link aggregator and discussion forum operated by Y Combinator since 2007. It serves six named feeds: top (the current front page), new (chronological firehose), best (time-decayed quality ranking), ask (Ask HN self-posts), show (Show HN project launches), and jobs (YC-portfolio job listings). Each feed exposes up to 500 item IDs. There's no ad layer and no algorithmic feed beyond the community's own ranking — what's on the front page is what got voted up.

Does Hacker News have an API? 🔎

Yes — but it returns IDs, not data. Y Combinator operates a documented Firebase REST API that is free and needs no authentication. The feed endpoints (/v0/topstories.json, /v0/newstories.json, etc.) return an array of up to 500 integer item IDs. To get any actual story content — title, URL, score, comment count, author, timestamp — you must request each item individually at /v0/item/<id>.json. That's an N+1 problem by design: one list call plus one call per story, up to 501 requests for a full feed export. Managing the concurrency and the fan-out cleanly is the job the Actor does for you.

What the data looks like

Every story becomes one flat, typed row. Here is a real output record:

{
  "id": 39000000,
  "type": "story",
  "title": "Show HN: Devil Scrapes — public-data Apify Actors with honest pricing",
  "url": "https://apify.com/DevilScrapes",
  "permalink": "https://news.ycombinator.com/item?id=39000000",
  "by": "devilscrapes",
  "score": 142,
  "descendants": 33,
  "text": null,
  "time": 1747353600,
  "posted_at": "2026-05-15T20:00:00+00:00",
  "scraped_at": "2026-05-15T20:05:00+00:00",
  "rank": 1
}

Thirteen fields, same shape every time, Pydantic-validated before it lands in the dataset. The rank field is the one piece the raw Firebase API doesn't expose — position in the feed at the moment of the scrape.

The naive approach (and why it falls apart) ⚙️

The first cut looks easy: fetch the feed endpoint, iterate the IDs, asyncio.gather the item calls. Three things break it in production:

1. Concurrency against a rate-limited upstream. The Firebase endpoint is tolerant, but hammering it with 500 concurrent calls will get your IP paced or blocked. We gate the fan-out behind an asyncio.Semaphore at a configurable limit (default 8 simultaneous fetches, tunable 1–32) so we stay polite on the upstream instead of stampeding it.

2. TLS fingerprinting. Even a public API inspects the TLS ClientHello at its edge, and Python's stdlib ssl module emits a fingerprint that's identifiably non-browser — which some CDNs use to challenge or throttle. We send requests through curl-cffi impersonating a real Chrome, Firefox, or Safari TLS + HTTP/2 fingerprint, so the handshake looks like a browser because functionally it is one. Each run picks a profile from a rotating pool so consecutive runs don't share an identical fingerprint.

3. Fan-out bookkeeping. When you're halfway through a 500-story export and the upstream 503s on item 312, a naive implementation lets one bad fetch poison the whole batch. We isolate every item fetch: a network error or non-200 on one story is logged and skipped, and the other 499 still land in your dataset. Dead and deleted items (which Firebase flags "dead" / "deleted") are filtered the same way — so you never get a silent crash and never get junk rows masquerading as data.

None of this is hard conceptually, but all of it takes time to get right — and getting it wrong means blank rows on the day you need clean data.

The Actor

The result is an Apify Actor: Hacker News Scraper. Run it from the Apify Console by picking a feed and clicking Start, or call it from the Python SDK:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/hacker-news-scraper").call(
    run_input={
        "feed": "show",
        "maxResults": 200,
        "includeText": True,
        "concurrency": 8,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["rank"], item["score"], item["title"])

The six input parameters:

Field	Default	Notes
`feed`	`"top"`	One of: `top`, `new`, `best`, `ask`, `show`, `jobs`
`maxResults`	`100`	0–500; set to 0 for the feed's full 500-item length
`includeText`	`true`	Fetches self-post HTML body for Ask HN / Show HN entries
`concurrency`	`8`	Parallel item fetches; 1–32
`proxyConfiguration`	off	Apify Proxy toggle — enable for compliance routing

Output streams into the run's dataset in real time — export as JSON, CSV, or Excel from the Console, or pull it via the dataset API.

What you would actually use this for 💡

Five concrete scenarios:

VC / startup scouting. Schedule a show feed run every morning and diff the new IDs against yesterday's export. When a Show HN post scores more than 50 points within four hours, that's a signal worth investigating — and you're watching it before TechCrunch does.

Newsletter curation. Feed the top 10 stories from the best feed into a weekly digest. The rank and score fields let you filter automatically — consistent criteria, reproducible every week.

Comment-volume alerts. Pipe rows into a Slack webhook. When descendants crosses 100 for a story in your interest category, trigger the alert and read the thread while it's hot.

ML training data. Historical top-story metadata — titles, scores, comment counts, posted timestamps — is a solid benchmark for headline quality, click-through prediction, or tech-discourse topic modelling. The posted_at and rank columns let you reconstruct the front page at any past moment.

Lead gen for developer tools. Watch the show feed for Show HN posts that mention your stack. The title field is full-text; filter for your keywords and reach out while the post is live.

Pricing — exact numbers 💰

Pay-per-event. You pay for the stories you get, nothing for the ones that don't land.

Event	Price
Actor start (one-off warm-up)	$0.005
Per story written to dataset	$0.002

Each row below is one run, so it includes the single $0.005 start charge plus $0.002 per story.

Pull	Cost
100 stories	$0.205
500 stories (full feed)	$1.005
1,000 stories	$2.005
10,000 stories	$20.005

Apify's $5 free trial credit covers roughly 2,490 stories before you need to add a card — about five full 500-story feed exports, enough to validate any use case.

The part that makes this more than a thin Firebase wrapper

The HN Firebase API is real, documented, and free. What it doesn't give you is rank. The API returns an ordered list of IDs, but once you fan out and collect all 500 records concurrently they come back out of order — you lose the original position unless you thread it through explicitly. This Actor preserves it: each row carries its 1-indexed rank in the feed at scrape time. That one field turns the dataset from a bag of stories into a time-series of front-page positions — useful for scoring velocity (how fast did a story climb?) and reconstructing any past front page from a scheduled archive.

The posted_at field is a second addition: Firebase returns time as Unix epoch seconds, so the Actor derives a consistent ISO-8601 UTC string on every row — letting you ORDER BY posted_at in SQL without a conversion step.

Limitations 🚧

Comment threads are not expanded. You get descendants (the count), not the comment tree. A sibling hacker-news-comments-scraper is on the roadmap if there's demand.
Dead and deleted stories are skipped. Firebase marks them "dead": true or "deleted": true; we filter them out. Need them? Open an issue.
text for self-posts is HTML, not Markdown. Ask HN and Show HN bodies come back as HTML from Firebase — run them through a sanitiser before display.
Up to 500 items per feed per run. Firebase exposes 500 items per feed endpoint; we don't go beyond that because the endpoint doesn't either.
Not a historical archive. This Actor scrapes the live feeds. For full historical HN data (38 million items), use the BigQuery public dataset or the Algolia HN Search API.

FAQ ❓

Is scraping Hacker News legal?
The data comes from Y Combinator's own documented public Firebase API, which is free and requires no authentication. We read only what the API exposes publicly, at a paced rate. Check your own jurisdiction and use case before building a commercial product on top of the data.

Is there an official Hacker News API?
Yes — the HN Firebase API (hacker-news.firebaseio.com) is documented and free. This Actor wraps it: you get the fan-out, concurrency control, ISO timestamps, and the rank column without writing the plumbing. If you only need five stories and want to write the code yourself, the raw API is the right choice.

Can I export to Google Sheets, S3, or a warehouse?
Yes — export CSV/Excel/JSON from the Apify Console, set a webhook on ACTOR.RUN.SUCCEEDED to push to Make, Zapier, or n8n, or query the dataset via the Apify API.

Can I scrape the comment threads too?
Not in this Actor — descendants gives you the comment count, but the tree itself isn't fetched. A sibling hacker-news-comments-scraper is on the roadmap if there's demand.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/hacker-news-scraper.

Free $5 trial credit, no credit card. Set feed to show and maxResults to 100, and you'll have every active Show HN post with scores, comment counts, and ranks in your dataset in under a minute. Found a use case I missed, or a field you need added? Drop it in the comments — I ship based on what people actually use.

Built by Devil Scrapes — Apify Actors that do the dirty work so your dataset stays clean. Pay-per-event, honest pricing, no junk fields. 😈

DEV Community