Comparing approaches to extracting Hacker News Who Is Hiring data

#webscraping #apify #jobs #dataengineering

There is more than one way to get Hacker News Who Is Hiring into a structured dataset, and the right answer depends a lot on how often you need fresh data, how much volume you are after, and how much engineering time you want to spend on the plumbing. Here is the trade-off matrix I worked through before settling on an approach.

What the data looks like (regardless of approach)

Hacker News Who Is Hiring Scraper Jobs, Salary & Tech Stack Data Scrape structured job listings from Hacker News "Ask HN: Who is Hiring?" monthly threads. The end-state schema is more or less fixed by the source:

commentId -- comment id
threadId -- thread id
threadTitle -- thread title
threadMonth -- thread month
author -- author
company -- company
role -- role
location -- location
remote -- remote
salary -- salary
techStack -- tech stack
visa -- visa
applyUrl -- apply url
email -- email
fullText -- full text
postedAt -- posted at
hnUrl -- hn url
scrapedAt -- scraped at

The differences between approaches are not really about schema -- they are about reliability, maintenance burden, and total cost of ownership.

Approach 1: Roll your own scraper

The DIY path. Pros: total control, no third-party dependency, very cheap on small volumes. Cons: you own the proxy rotation, the rate-limit handling, the retry logic, the schema-drift detection, the scheduling, the monitoring, and the bug pager.

If you have one engineer who has done this kind of work before and you only need one source, this is fine. If you need ten sources, the maintenance burden compounds faster than you would expect.

Approach 2: Generic crawl framework + custom selectors

The middle path. Use Scrapy or Playwright with your own parsing logic. Pros: less boilerplate, decent observability for free. Cons: you still own the proxy and rate-limit story, plus you are now coupled to a framework that has its own learning curve.

This is a sensible choice for multi-source projects where you want one mental model across all the scrapers.

Approach 3: Managed scraping infrastructure

Use a hosted runner that handles proxies, scheduling and storage. Pros: minimal engineering time, predictable cost, very fast to get a first run out the door. Cons: cost scales with volume, less control over edge cases.

For one-off explorations and steady-state recurring pipelines under a few million records per month, this is what I keep ending up on.

Two sample records (for context)

What the eventual output looks like, regardless of how you got there:

{
  "commentId": "47975574",
  "threadId": "47975571",
  "threadTitle": "Ask HN: Who is hiring? (May 2026)",
  "threadMonth": "May 2026",
  "author": "chrisposhka",
  "company": "Pathos AI",
  "role": "Senior Software",
  "location": "NYC",
  "remote": "Hybrid",
  "salary": null
}

{
  "commentId": "47975581",
  "threadId": "47975571",
  "threadTitle": "Ask HN: Who is hiring? (May 2026)",
  "threadMonth": "May 2026",
  "author": "verobytes",
  "company": "NetBird",
  "role": "Berlin, Germany",
  "location": "Berlin, Remote, remote",
  "remote": "Remote",
  "salary": null
}

How I would pick

A rough decision tree:

One-off exploration: managed approach. The setup-cost of DIY is not worth it for a single run.
Steady recurring feed, single source, modest volume: managed approach unless cost becomes prohibitive.
Multiple sources, large volume, dedicated team: framework + custom selectors. The unit economics flip.
Adversarial source with active anti-bot defences: probably a specialist provider or a custom build with serious proxy budget.

Verdict

For Hacker News Who Is Hiring specifically the volume and update-frequency profile is moderate, and a managed runner is the most defensible default. The dataset shape above is the same either way.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/hacker-news-who-is-hiring-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.