Damien Alleyne

Posted on Feb 2 • Originally published at blog.alleyne.dev

I Built 2 Job Scrapers in One Weekend to Avoid Paying for Data

#automation #showdev #sideprojects #webscraping

I run GlobalRemote, a curated job board that shows interview processes and hiring transparency upfront. To keep it relevant, I needed to update it 2x per week with fresh jobs from Greenhouse and Ashby boards.

The problem? The scraper I was using fetched every job from each company — Sales, HR, Support, everything — and stored it all in my Apify dataset. With 6-8 companies, that's 300-400 jobs per scrape, but only 5-10 were actually relevant.

I was burning through my Apify free tier ($5/month, ~2000 dataset operations) on irrelevant data. Two scrapes per week would blow past my quota. I wasn't ready to pay for a higher tier just to subsidize wasteful scraping.

So my options were:

Update infrequently (once every 2-3 weeks) and let the board go stale
Pay for a higher Apify tier to subsidize wasteful scraping
Build my own scrapers with department filtering

I chose #3.

The scrapers are now live on Apify Store, open-source, and I'm dogfooding them on GlobalRemote right now.

The Problem: I Couldn't Update Frequently Enough

The scraper I was using worked like this:

Fetch all jobs from a company's job board
Store everything in an Apify dataset
I filter locally for the jobs I actually want

This makes sense if you want all the jobs. But for a curated board like GlobalRemote, I only wanted:

Engineering roles (not Sales, Marketing, HR)
From specific departments (e.g., "Code Wrangling" at Automattic, "Engineering" at GitLab)
Recent postings (not 6-month-old listings)

With 300-400 jobs stored per scrape and only 5-10 relevant, I was wasting my dataset quota. Two scrapes per week would exceed my free tier limit. The choice was: pay for a higher tier or update less frequently. Neither was ideal.

The Solution: Per-URL Department Filtering

I built two Apify actors:

Greenhouse Job Scraper (Automattic, GitLab, Speechify, etc.)
Ashby Job Scraper (Buffer, Zapier, RevenueCat, etc.)

Both support per-URL configuration, meaning each company can have different filters:

{
  "urls": [
    {
      "url": "https://job-boards.greenhouse.io/automatticcareers",
      "departments": [307170],
      "maxJobs": 50,
      "daysBack": 7
    },
    {
      "url": "https://job-boards.greenhouse.io/gitlab",
      "departments": [4011044002],
      "maxJobs": 20
    }
  ]
}

The scraper:

Fetches department metadata
Filters jobs by department ID before storing them
Only stores jobs that match your criteria
You only pay for the jobs you actually get (not the ones filtered out)

Result: I went from storing 300-400 jobs per scrape to 30-50 jobs — an 80% reduction in dataset usage.

How I Built It

Tech Stack

Apify platform — handles hosting, scheduling, dataset storage
Greenhouse + Ashby APIs — public APIs for job boards
AI (Claude) — for rapid development

How the APIs Work

Both platforms expose public APIs for their job boards. This meant I could:

Fetch departments/teams programmatically
Filter by department/team ID before fetching job details
Only pull full job data for matches
No browser automation or HTML scraping needed

This is key: I'm filtering before fetching details, not after. Most scrapers fetch everything, then you filter locally. Mine filters first, then only fetches what you need.

Development Process

I built both scrapers over one weekend using AI (Claude).

Saturday (Jan 31): Greenhouse scraper

Prompt: "Build an Apify actor that scrapes Greenhouse job boards with department filtering"
AI figured out the API structure
I tested on Automattic and GitLab job boards

Sunday (Feb 1): Ashby scraper

Prompt: "Build an Apify actor for Ashby job boards with department filtering (similar structure to the existing Greenhouse scraper)"
AI figured out Ashby's API
Tested on Buffer, Zapier, RevenueCat

What AI handled:

Reading API documentation (Greenhouse, Ashby, Apify actor structure)
Writing the scraper logic and Apify boilerplate
Handling edge cases (null departments, missing dates)
Generating input/output schemas

What I did:

Product decisions (per-URL config vs global config)
Testing on real job boards
Iterating when things didn't work
Catching issues (e.g., updated Node 20 → 22 in Dockerfile)

I never opened:

Total development time: One weekend.

AI is a co-pilot, not autopilot - but it handled all the research and boilerplate so I could focus on testing and product decisions.

Dogfooding on GlobalRemote

I'm using both scrapers to populate GlobalRemote right now.

When I need fresh data, I trigger both scrapers. They return 30-50 relevant jobs instead of 300-400, keeping me well within my Apify free tier.

What I've learned from dogfooding:

Department filtering reduced dataset usage by ~80%
I can now update regularly without exceeding my quota

If the scrapers break, GlobalRemote breaks. That's a strong incentive to keep them working.

What I Learned

1. Filter before storing, not after

For curated job boards, filtering before storage is way more cost-effective. The scraper I was using didn't do this.

2. Per-URL config beats global config

My first version had global department filters (same filter for all companies). That was a mistake. Different companies organize departments differently. Per-URL config gives users way more flexibility.

3. Real examples > Fake examples

In my README, I used real companies (Automattic, GitLab) and real department IDs (307170 = "Code Wrangling" at Automattic). Fake examples would've been useless for someone trying to replicate this.

4. AI accelerates weekend projects into production tools

I shipped two working scrapers in one weekend without reading a single API doc. AI handled research and implementation; I handled product decisions and testing. That's the real power of AI in 2026.

5. Open-sourcing on Apify was easy

Publishing to Apify Store took ~10 minutes:

Add README
Set pricing
Add input/output schemas
Add Banking information (they prefer PayPal)
Click "Publish"

What's Next

Both scrapers are live and stable. I will be using them on GlobalRemote twice a week, well within my free tier.

Potential improvements:

Add automated tests (right now it's just manual verification)
Add salary parsing to Ashby scraper (Greenhouse already extracts salary ranges)
Build a Lever scraper (if there's demand)

But honestly? I built these to solve my own problem. If other people find them useful, great. If not, I'm still updating GlobalRemote 2x/week without blowing my budget.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community