Freshactors

Posted on Jun 1

How to scrape Greenhouse & Lever ATS jobs data with Python (no API key needed)

#webscraping #api #node #tutorial

If you need live job-posting data — for sales intelligence, a niche job board, or labor-market research — two ATS platforms cover a huge slice of the market: Greenhouse and Lever. The good news: both expose public JSON boards, so you don't need an API key, an account, or a headless browser.

The annoying part: Greenhouse and Lever return completely different JSON shapes. Scrape both yourself and you end up maintaining two parsers that break independently. In this tutorial we'll skip that tax by calling a ready-made actor that normalizes both ATS into one schema — and you'll run it from Python in a few lines.

The problem with rolling your own

Let's be concrete. A raw Greenhouse board lives at:

https://boards-api.greenhouse.io/v1/boards/{token}/jobs

A Lever board lives at:

https://api.lever.co/v0/postings/{token}?mode=json

Different hosts, different field names, different location/department conventions, and different ways of expressing "remote". You can write adapters for both — but then you own them forever, including the day one of them quietly changes a field name and your pipeline silently goes empty.

A cleaner path: hand a list of company tokens to an actor that already maps both into a single record, and just consume the output. Here's how with the Greenhouse & Lever Jobs Scraper.

Step 1 — Install the Apify client

pip install apify-client

Grab your Apify API token from Settings → Integrations in the Apify Console. We'll read it from an environment variable so it never lands in source control:

export APIFY_TOKEN="apify_api_xxx"

Step 2 — Run the actor with a list of companies

The actor takes a companies array of bare tokens (gitlab) or full board URLs (https://jobs.lever.co/spotify). With ats: "auto", bare tokens are tried against Greenhouse first, then Lever; URLs are detected automatically.

import os
from apify_client import ApifyClient

client = ApifyClient(os.environ["APIFY_TOKEN"])

run_input = {
    "companies": [
        "gitlab",                              # bare token (auto-detected)
        "https://jobs.lever.co/spotify",       # Lever URL
        "https://boards.greenhouse.io/airbnb", # Greenhouse URL
    ],
    "ats": "auto",
    "includeDescription": True,
    "maxJobsPerCompany": 500,
}

# Blocks until the run finishes, then returns run metadata.
run = client.actor("freshactors/greenhouse-lever-jobs-scraper").call(run_input=run_input)

print("Run status:", run["status"])
print("Dataset id:", run["defaultDatasetId"])

.call() is synchronous — it waits for the run to complete and hands you the run object, including the defaultDatasetId where results land.

Step 3 — Read the normalized output

Every record — Greenhouse or Lever — comes back in the same shape. ATS-specific gaps are null, never missing keys, so your downstream code can rely on the schema:

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(
        f'[{item["_source"]:<10}] '
        f'{item["company"]:<10} '
        f'{item["title"]}  '
        f'({item.get("workplaceType") or "n/a"}, {item.get("location") or "n/a"})'
    )

A single record looks like this:

{
  "_type": "job",
  "_schemaVersion": "1.0",
  "_source": "lever",
  "company": "spotify",
  "jobId": "1ff4a4e3-...",
  "title": "Account Executive - Backstage",
  "department": "Operations and Business Support",
  "team": "Platform",
  "location": "Toronto",
  "allLocations": ["Toronto"],
  "workplaceType": "hybrid",
  "commitment": "Permanent",
  "country": "CA",
  "url": "https://jobs.lever.co/spotify/1ff4a4e3-...",
  "applyUrl": "https://jobs.lever.co/spotify/1ff4a4e3-.../apply",
  "postedAt": "2026-03-12T17:10:21.350Z",
  "updatedAt": null,
  "descriptionText": "About the role...",
  "_scrapedAt": "2026-06-01T09:14:02.118Z"
}

Because both ATS share this schema, you never branch on _source to read a field — you only read it if you want to know where the record came from.

Step 4 — A practical filter (remote roles, posted recently)

Say you only care about remote engineering roles. With one schema, the filter is trivial:

from datetime import datetime, timedelta, timezone

cutoff = datetime.now(timezone.utc) - timedelta(days=14)
remote_recent = []

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item.get("workplaceType") != "remote":
        continue
    posted = item.get("postedAt")
    if posted and datetime.fromisoformat(posted.replace("Z", "+00:00")) >= cutoff:
        remote_recent.append(item)

print(f"{len(remote_recent)} remote roles posted in the last 14 days")

No Greenhouse-vs-Lever special-casing — workplaceType and postedAt mean the same thing in every record.

Step 5 — Lighter, faster runs

Two knobs control cost and speed:

includeDescription: false skips fetching full descriptionText — much faster when you only need titles, departments, and locations for, say, a hiring-signal dashboard.
maxJobsPerCompany caps postings per company (1–5000) so a 3,000-role employer doesn't dominate your run.

run_input = {
    "companies": ["gitlab", "spotify", "netflix", "airbnb"],
    "ats": "auto",
    "includeDescription": False,   # metadata only
    "maxJobsPerCompany": 200,
}

Prefer Node.js?

Same actor, same input, the JavaScript client:

npm install apify-client

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });

const run = await client.actor('freshactors/greenhouse-lever-jobs-scraper').call({
    companies: ['gitlab', 'https://jobs.lever.co/spotify'],
    ats: 'auto',
    includeDescription: true,
    maxJobsPerCompany: 500,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const job of items) {
    console.log(`[${job._source}] ${job.company} — ${job.title}`);
}

What about cost?

It's pay-per-event: $0.02 per company board fetched and $0.0005 per job posting returned. So 5 companies returning 100 postings total is 5 × $0.02 + 100 × $0.0005 = $0.15. No subscription — you pay for what you pull.

Why use the actor instead of hitting the boards directly?

You can curl those public endpoints yourself. The reason to use the actor is maintenance: it normalizes both ATS into one schema, auto-detects Greenhouse vs Lever, isolates per-company failures, and is monitored by a daily canary so a silent ATS field change doesn't quietly empty your pipeline. That operational reliability is the whole point.

If you want to skip the two-parser tax, the actor is here: Greenhouse & Lever Jobs Scraper on Apify. Run it on a schedule, point it at your target companies, and consume one clean JSON feed.

Happy scraping.

DEV Community