Every company posts jobs differently. That one fact will break your scraper within the first week if you are not ready for it.
I built https://www.jobscroller.net to crawl 1,100 company career pages daily and serve the results with direct apply links. No aggregator chains, no stale listings. What started as a simple Puppeteer script turned into a system with 20 separate adapters, each handling a mdifferent hiring platform. Here is what I actually learned.
The ATS problem is bigger than you think
An ATS is the software companies use to manage job postings. Greenhouse, Lever, Ashby, Workday, Rippling, BambooHR, iCIMS, Workable, SmartRecruiters. The list goes on. Each one works differently. Some expose clean JSON APIs. Some render everything in JavaScript and give you nothing without a real browser. Some do both but make the API hard to find.
When I started I wrote one generic scraper. It worked on maybe 30% of sites. The other 70% returned empty results, threw errors, or silently failed.
The fix was to stop trying to write one solution and accept that each platform needs its own adapter. Now the crawler detects the platform first, then routes to the right adapter.
if (company.job_domain?.includes('.myworkdayjobs.com')) {
jobs = await fetchWorkdayJobs(company.job_domain)
} else if (company.lever_slug) {
jobs = await fetchLeverJobs(company.lever_slug)
} else if (company.job_domain?.includes('ashbyhq.com')) {
jobs = await fetchAshbyJobs(company.job_domain)
} else {
jobs = await fetchWithPuppeteer(company.job_domain)
}
Detection comes from a job_domain field I maintain per company. When I add a new company I identify which platform they use and store that URL. This takes 30 seconds per company and saves hours of debugging later.
Workday will test your patience
Workday is the most common enterprise ATS and the most unpredictable to work with.
Workday exposes a JSON API at a predictable path. For a company at nvidia.wd5.myworkdayjobs.com the jobs endpoint sits at https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/careers/jobs. Straightforward.
Except some Workday instances run on wd5 instead of wd3 and they reject the sortBy query parameter with a 422 error. Not a 400, not a 500. A 422. The first time I saw this I spent two hours checking my request format before realizing the parameter itself was the problem on
those instances.
The fix detects this on the first request and retries cleanly.
if (res.status === 422 && useSortBy && offset === 0) {
console.warn('Workday 422 with sortBy, retrying without it')
useSortBy = false
continue
}
The second Workday problem is scale. Nvidia has over 2,000 job listings. Fetching the detail page for each one to get the full description would take 20+ minutes for one company and likely trigger rate limiting. For companies above 250 jobs I skip detail fetches entirely and mstore the title and location from the listing page only.
const SKIP_DETAILS_THRESHOLD = 250
const skipDetails = listings.length >= SKIP_DETAILS_THRESHOLD
You lose some data. You gain a crawl that actually finishes.
When there is no API, use Puppeteer carefully
Some platforms give you nothing without a real browser. Rippling, BambooHR, Breezy and a few others render their job listings entirely in JavaScript. You need Chromium to load the page before you can read anything.
Puppeteer handles this but it comes with real costs. It is slow. It uses a lot of memory. And it is fragile in ways that API calls are not. A DOM change on the company's side silently breaks your extractor and you only notice when you check the logs the next morning.
I run browser scrapes with a timeout and treat any failure as a signal to skip that company for the day rather than let it block the whole crawl.
const browser = await puppeteer.launch({ headless: true })
const page = await browser.newPage()
await page.goto(careersUrl, {
waitUntil: 'networkidle2',
timeout: 30000
})
The networkidle2 wait strategy works better than load for SPAs because it waits until network activity settles, which usually means the job listings have rendered.
On Windows specifically, Puppeteer leaves temp profile directories behind after each run. If you do not clean those up, you fill your disk within a week. I added cleanup logic that handles file locks gracefully because Windows holds onto those files longer than you expect.
Syncing without duplicates
After each crawl you have a list of jobs from the company's current careers page. Your database has the jobs you stored from previous crawls. You need to figure out what is new, what disappeared, and what came back.
The naive approach is delete everything and reinsert. Do not do this. You lose history, you lose user data tied to job IDs, and you generate thousands of unnecessary database writes.
The right approach is a three way diff.
for (const job of freshJobs) {
if (existingUrls.has(job.url)) {
// seen before, mark active in case it was deactivated
updates.push({ id: existingUrls.get(job.url), is_active: true })
} else {
// new job, insert it
inserts.push(job)
}
}
for (const [url, id] of existingUrls.entries()) {
if (!newUrls.has(url)) {
// disappeared from careers page, deactivate
updates.push({ id, is_active: false })
}
}
The url column has a UNIQUE constraint in the database. This is your safety net. Even if a bug in your code tries to insert the same job twice, the database rejects the second insert. Build this constraint in from day one.
One edge case that bit me early: some companies run partial crawl failures where the API returns fewer results than normal because of a timeout or rate limit. If you deactivate all "missing" jobs after a partial fetch, you incorrectly mark hundreds of real active jobs as inactive. Now I return an empty array from any adapter that encounters an error mid-pagination, which tells the sync layer to skip deactivation for that company entirely.
if (fetchFailed && listings.length > 0) {
console.warn('Partial fetch detected, skipping to avoid false deactivations')
return []
}
The 404 mistake that hurt SEO
I originally deleted inactive jobs from the database after 5 days. The reasoning seemed sound. Old data takes up space. Jobs that have been closed for a week are not useful to anyone.
What I did not account for is Google's crawl schedule. Google indexed a job page, then came back to recrawl it two weeks later. By that point the row was gone from the database. The page returned a 404. I ended up with 2,247 URLs in Google Search Console flagged as not found.
Each 404 on a previously indexed page is a signal that your site has low quality content. At scale this hurts your domain's overall standing with Google.
The fix is simple. Never delete job rows. Set is_active = false and keep the row forever. The page renders a "this position has been filled" view with similar open roles instead of a 404. Google comes back, finds a real page, updates its index. Problem solved.
// removed this block entirely
// const { error } = await supabase
// .from('Jobs')
// .delete()
// .eq('is_active', false)
// .lt('updated_at', cutoff)
What actually takes the most time
Building the initial adapters took about two weeks. Maintaining them is an ongoing job.
Companies rebrand and change their careers URLs. ATS platforms push updates that break your selectors. A company that used Greenhouse six months ago switches to Ashby. You only find out when the crawl logs show zero jobs for three days in a row.
I check the crawl summary every morning. Any company showing zero jobs two days in a row gets investigated. Usually it is a URL change. Sometimes it is a platform switch. Occasionally it is a company that paused hiring.
The engineering is the easy part. Staying on top of 1,100 moving targets is the actual work.
If you are building something similar I am happy to answer questions in the comments. The site is live at https://www.jobscroller.net and the API is open if you want to query the data directly.
Top comments (0)