DEV Community

Damien Alleyne
Damien Alleyne

Posted on • Originally published at blog.alleyne.dev

I Built 2 Job Scrapers in One Weekend to Avoid Paying for Data

I run GlobalRemote, a curated job board that shows interview processes and hiring transparency upfront. To keep it relevant, I needed to update it 2x per week with fresh jobs from Greenhouse and Ashby boards.

The problem? The scraper I was using fetched every job from each company — Sales, HR, Support, everything — and stored it all in my Apify dataset. With 6-8 companies, that's 300-400 jobs per scrape, but only 5-10 were actually relevant.

I was burning through my Apify free tier ($5/month, ~2000 dataset operations) on irrelevant data. Two scrapes per week would blow past my quota. I wasn't ready to pay for a higher tier just to subsidize wasteful scraping.

So my options were:

  1. Update infrequently (once every 2-3 weeks) and let the board go stale

  2. Pay for a higher Apify tier to subsidize wasteful scraping

  3. Build my own scrapers with department filtering

I chose #3.

The scrapers are now live on Apify Store, open-source, and I'm dogfooding them on GlobalRemote right now.

The Problem: I Couldn't Update Frequently Enough

The scraper I was using worked like this:

  1. Fetch all jobs from a company's job board

  2. Store everything in an Apify dataset

  3. I filter locally for the jobs I actually want

This makes sense if you want all the jobs. But for a curated board like GlobalRemote, I only wanted:

  • Engineering roles (not Sales, Marketing, HR)

  • From specific departments (e.g., "Code Wrangling" at Automattic, "Engineering" at GitLab)

  • Recent postings (not 6-month-old listings)

With 300-400 jobs stored per scrape and only 5-10 relevant, I was wasting my dataset quota. Two scrapes per week would exceed my free tier limit. The choice was: pay for a higher tier or update less frequently. Neither was ideal.

The Solution: Per-URL Department Filtering

I built two Apify actors:

Both support per-URL configuration, meaning each company can have different filters:

{
  "urls": [
    {
      "url": "https://job-boards.greenhouse.io/automatticcareers",
      "departments": [307170],
      "maxJobs": 50,
      "daysBack": 7
    },
    {
      "url": "https://job-boards.greenhouse.io/gitlab",
      "departments": [4011044002],
      "maxJobs": 20
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The scraper:

  1. Fetches department metadata

  2. Filters jobs by department ID before storing them

  3. Only stores jobs that match your criteria

  4. You only pay for the jobs you actually get (not the ones filtered out)

Result: I went from storing 300-400 jobs per scrape to 30-50 jobs — an 80% reduction in dataset usage.

How I Built It

Tech Stack

  • Apify platform — handles hosting, scheduling, dataset storage

  • Greenhouse + Ashby APIs — public APIs for job boards

  • AI (Claude) — for rapid development

How the APIs Work

Both platforms expose public APIs for their job boards. This meant I could:

  • Fetch departments/teams programmatically

  • Filter by department/team ID before fetching job details

  • Only pull full job data for matches

  • No browser automation or HTML scraping needed

This is key: I'm filtering before fetching details, not after. Most scrapers fetch everything, then you filter locally. Mine filters first, then only fetches what you need.

Development Process

I built both scrapers over one weekend using AI (Claude).

Saturday (Jan 31): Greenhouse scraper

  • Prompt: "Build an Apify actor that scrapes Greenhouse job boards with department filtering"

  • AI figured out the API structure

  • I tested on Automattic and GitLab job boards

Sunday (Feb 1): Ashby scraper

  • Prompt: "Build an Apify actor for Ashby job boards with department filtering (similar structure to the existing Greenhouse scraper)"

  • AI figured out Ashby's API

  • Tested on Buffer, Zapier, RevenueCat

What AI handled:

  • Reading API documentation (Greenhouse, Ashby, Apify actor structure)

  • Writing the scraper logic and Apify boilerplate

  • Handling edge cases (null departments, missing dates)

  • Generating input/output schemas

What I did:

  • Product decisions (per-URL config vs global config)

  • Testing on real job boards

  • Iterating when things didn't work

  • Catching issues (e.g., updated Node 20 → 22 in Dockerfile)

I never opened:

Total development time: One weekend.

AI is a co-pilot, not autopilot - but it handled all the research and boilerplate so I could focus on testing and product decisions.

Dogfooding on GlobalRemote

I'm using both scrapers to populate GlobalRemote right now.

When I need fresh data, I trigger both scrapers. They return 30-50 relevant jobs instead of 300-400, keeping me well within my Apify free tier.

What I've learned from dogfooding:

  • Department filtering reduced dataset usage by ~80%

  • I can now update regularly without exceeding my quota

If the scrapers break, GlobalRemote breaks. That's a strong incentive to keep them working.

What I Learned

1. Filter before storing, not after

For curated job boards, filtering before storage is way more cost-effective. The scraper I was using didn't do this.

2. Per-URL config beats global config

My first version had global department filters (same filter for all companies). That was a mistake. Different companies organize departments differently. Per-URL config gives users way more flexibility.

3. Real examples > Fake examples

In my README, I used real companies (Automattic, GitLab) and real department IDs (307170 = "Code Wrangling" at Automattic). Fake examples would've been useless for someone trying to replicate this.

4. AI accelerates weekend projects into production tools

I shipped two working scrapers in one weekend without reading a single API doc. AI handled research and implementation; I handled product decisions and testing. That's the real power of AI in 2026.

5. Open-sourcing on Apify was easy

Publishing to Apify Store took ~10 minutes:

  • Add README

  • Set pricing

  • Add input/output schemas

  • Add Banking information (they prefer PayPal)

  • Click "Publish"

What's Next

Both scrapers are live and stable. I will be using them on GlobalRemote twice a week, well within my free tier.

Potential improvements:

  • Add automated tests (right now it's just manual verification)

  • Add salary parsing to Ashby scraper (Greenhouse already extracts salary ranges)

  • Build a Lever scraper (if there's demand)

But honestly? I built these to solve my own problem. If other people find them useful, great. If not, I'm still updating GlobalRemote 2x/week without blowing my budget.


Links

If you're building a job board or need ATS data, feel free to use them. And if you have feedback or find bugs, I'm on LinkedIn or reachable via Apify.

Top comments (0)