Devil Scrapes

Posted on May 31

ATS Tech Stack Detector: pull company back-end stacks from jobs for $5.05/1K

#webscraping #python #apify #data

Quick answer: Greenhouse, Lever, and Ashby each publish a public job-board API that any job aggregator can hit — no auth required. An ATS tech stack detector calls those APIs, strips the HTML from each job description, then runs a curated vocabulary of ~110 canonical tech names against the text to produce a deduplicated detected_techs list per job row. The Apify Actor below does it for $0.005 per row ($5.05 per 1,000 jobs, including the one-time $0.05 start fee), with exponential backoff, per-company fault isolation, and Pydantic-validated output ready to drop into any CRM pipeline.

Every B2B sales motion eventually hits the same wall: you need to know what your prospect actually runs in production, and front-end sniffers like BuiltWith and Wappalyzer can't tell you. They read Cloudflare headers, tag-manager IDs, and JS bundle manifests — which surfaces HubSpot and Segment beautifully but says nothing about whether the company runs Postgres or Snowflake, Kafka or RabbitMQ, Kubernetes or Nomad. Engineering teams declare their real stack in job descriptions. A "Senior Backend Engineer" posting that lists Django, Postgres, Kafka, and Kubernetes tells you more about their infrastructure than any front-end scan ever will.

The hard part is extracting that signal at scale: hit the APIs, unpick the HTML, normalize "node.js" vs "Node.js" vs "NodeJS", deduplicate, then keep the vocabulary current as new tools ship. Or call one Actor and get a typed dataset back. Here's how the second option works.

What is an ATS? 🔎

An Applicant Tracking System (ATS) is the software companies use to post jobs, accept applications, and manage candidates. The three platforms this Actor covers — Greenhouse, Lever, and Ashby — collectively handle a large share of Series A+ tech-company hiring, and all three publish their job-board data via public, unauthenticated REST APIs so that aggregators and "Powered by Greenhouse" listings can pull postings without a login.

That public-API design is the foundation this Actor builds on. We call the same endpoint a job-board aggregator would, then add what they don't: tech-stack extraction from the description text, normalization to a canonical vocabulary, and a flat, validated output schema.

Does Greenhouse have an API for job postings?

Yes. Greenhouse's Job Board API (boards-api.greenhouse.io/v1/boards/{token}/jobs?content=true) returns every active posting for any company with a public board, including the full description HTML. Lever exposes api.lever.co/v0/postings/{token} with the same shape, and Ashby uses api.ashbyhq.com/posting-api/job-board/{token} — case-sensitive on the token (Ramp works; ramp returns zero). None require an API key or OAuth.

What the data looks like

Each job posting produces one flat, typed row. Every field comes directly from the ATS response or from our detection pass — nothing is inferred or synthesised:

{
  "ats": "greenhouse",
  "company_token": "airtable",
  "job_id": "4567890",
  "title": "Senior Backend Engineer",
  "location": "Remote, US",
  "department": "Engineering",
  "url": "https://boards.greenhouse.io/airtable/jobs/4567890",
  "description_text": "We're looking for a backend engineer with deep experience in Python, Django, Postgres and Redis. Our data pipeline runs on AWS with Kubernetes...",
  "detected_techs": ["AWS", "Django", "Kubernetes", "Postgres", "Python", "Redis"],
  "posted_at": "2026-05-01T09:00:00+00:00",
  "scraped_at": "2026-05-31T14:22:11+00:00"
}

Eleven fields, the same shape every time. detected_techs is sorted (case-insensitive) and deduplicated. Every row is Pydantic-validated before it is written, so you never receive a missing required field or an ats value outside ["greenhouse", "lever", "ashby"].

The naive approach (and why it falls apart)

The first attempt looks straightforward: hit the API, grab the description, split on whitespace, look for known tech names. It breaks almost immediately, and the breakage is instructive.

1. HTML encoding layers. Greenhouse double-encodes its content field — the HTML you receive contains <div> where the original had <div>. A single html.unescape() leaves literal <div> tags in the text that your regex then trips over. The parser unescapes twice, then strips tags, before the vocabulary scan runs.

2. Lever's split description. Lever's descriptionPlain often omits everything inside "Requirements" bullets — that content lives in lists[].content chunks in a separate array. Read only descriptionPlain and you silently miss half the tech signals in engineering roles. We concatenate every lists[].content chunk before scanning.

3. Case-sensitivity on Ashby tokens. Write ramp instead of Ramp and Ashby's API returns an empty list — not a 404, just zero jobs. Without a guard, you produce an empty dataset and think you scraped it. The Actor fails loud when every company returns zero rows rather than reporting a hollow success.

4. Rate-limiting. All three APIs throttle undeclared bursts. We retry on 429 and 503 with exponential backoff (2 seconds, doubling, capped at 30 seconds, up to 5 attempts), and we honour Retry-After when the server sends it. On multi-company runs, one company isolating a rate limit doesn't abort the rest — each company's fetch loop is isolated, so partial success is partial success, not total failure.

5. Vocabulary normalization. Job descriptions use "Node.js", "NodeJS", "Node JS", and "Node" to mean the same runtime, and "postgres", "Postgres", and "PostgreSQL" for the same database. A naive substring match either misses casing variants or fires on the wrong word ("Java" inside "JavaScript"). The vocabulary uses case-insensitive word-boundary regex (\bPostgres\b), longest-match-first, so the canonical name is emitted regardless of how the author capitalised it and substrings never false-fire. Underneath, every request goes out through curl-cffi impersonating a real Chrome 131 TLS + HTTP/2 fingerprint, we thread Apify residential proxies when you flip the useProxy flag on, and we hand back Pydantic-validated typed rows — no data means no charge.

None of that is complicated in hindsight. All of it is the difference between a script that worked on three companies in a notebook and a pipeline that survives 500 companies overnight.

The Actor ⚙️

I packaged this as an Apify Actor: ATS Tech Stack Scraper.

You can run it from the Apify Console by filling in the input form, or call it programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/ats-tech-stack-scraper").call(
    run_input={
        "companies": [
            {"companyToken": "airtable", "atsType": "greenhouse"},
            {"companyToken": "palantir", "atsType": "lever"},
            {"companyToken": "Ramp", "atsType": "ashby"},
        ],
        "maxJobsPerCompany": 100,
        "minTechsDetected": 2,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["company_token"], item["title"], item["detected_techs"])

minTechsDetected: 2 is a useful floor that drops sales, marketing, and ops roles mentioning no engineering tools — you keep only rows where the vocabulary actually fired. maxJobsPerCompany caps spend on large companies with hundreds of open roles.

Finding a company's ATS token — it's the board slug in the careers URL: boards.greenhouse.io/{token} (e.g. airtable), jobs.lever.co/{token} (e.g. palantir), or jobs.ashbyhq.com/{Token} (e.g. Ramp — case-sensitive; lowercase returns zero jobs).

Use cases 💡

Three buyer types, five concrete scenarios:

1. B2B sales qualification. You sell a Postgres performance tool. Filter detected_techs for "Postgres" and sort by posted_at recency — any company actively hiring "Senior Backend Engineer" with Postgres in the JD is pre-qualified, and you know they're staffing that stack now, not three years ago.

2. Recruiter sourcing. Filling a Rust role? Pull every engineering posting from your target-company list, filter detected_techs for "Rust", and you have a live shortlist of companies already running it in production. Pair with minTechsDetected: 3 to exclude postings that mention Rust only in passing.

3. Competitive intelligence. Track which competitors are posting Kubernetes, Snowflake, or dbt roles. A cluster of new infra-heavy postings often signals a platform migration before it becomes a press release.

4. CRM enrichment. Feed the output into Clay, HubSpot Operations Hub, or Attio as a custom enrichment step — replacing a BuiltWith seat for the subset of back-end and data-platform signals engineering-buying teams actually need.

5. Investment research. Map a pipeline target's tech footprint across quarters. A company that moves from "MySQL" to "Postgres" and "Snowflake" postings in eighteen months is scaling its data stack — a signal front-end sniffers miss entirely.

Pricing — exact numbers 💰

Pay-per-event. You pay for rows you get; rows that don't land cost nothing.

Event	Price	When
`actor-start`	$0.05	Once per run
`result-row`	$0.005	Per job row emitted

Rows pulled	Cost
50 rows	$0.30
200 rows	$1.05
1,000 rows	$5.05
10,000 rows	$50.05

For a 200-company qualification pass (roughly 2,000–4,000 rows), this Actor costs $10–$20 — no subscription, no seat limit, no contract. Apify's $5 free trial credit covers your first ~990 rows with no credit card.

The technically interesting bit

The vocabulary is the product. ~110 canonical tech names sounds small, but precision matters more than recall here — a false positive ("Java" matching "JavaScript" without word boundaries) poisons the very signal buyers are paying for. Every entry uses case-insensitive word-boundary matching (\bPostgres\b), and the alternation is sorted longest-first so "PostgreSQL" wins over "Postgres" on a shared prefix while "Javanese" and "Djangocon" never fire. Buyers in this category have been burned by LLM-based detection inventing plausible-sounding stacks; a deterministic, auditable regex is the antidote. Miss a tool? Open a feedback ticket and it ships in the next version — no model fine-tuning required.

Limitations 🚧

Regex detection only. Tools outside the curated ~110-name vocabulary won't appear in detected_techs. Missing one? Submit a feedback request on the Store page and it ships in the next release.
You supply the token. The Actor requires the ATS board token — it does not discover which ATS a company uses or look it up by company name. No LinkedIn or Workday support.
Greenhouse double-encoding edge cases. The parser handles the common double-encode; exotic Greenhouse configs may leave extra HTML fragments in description_text.
Ashby case-sensitivity is on you. Pass ramp when the token is Ramp and you get zero rows — find the correct casing in the Ashby job-board URL.
7-day default storage. On the FREE plan, default datasets expire after 7 days. For persistent pipelines, open a named dataset: Actor.open_dataset(name="my-stack-data").

FAQ

Is it legal to scrape job boards?
These endpoints are public APIs published intentionally by Greenhouse, Lever, and Ashby so that job aggregators can distribute postings. The Actor requests only publicly visible job post data at a rate the APIs can handle, collects only company-aggregated tech signals (no personal applicant data), and bypasses no authentication. As always, review your own jurisdiction and intended use case before running in production.

Can I export the results to a spreadsheet or data warehouse?
Yes. The Apify Console exports CSV, Excel, and JSON directly from the dataset view. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it programmatically via the Apify API.

Is there an official API for this?
The Actor itself is the programmatic interface — call it via the Apify Python client or REST API. The underlying Greenhouse, Lever, and Ashby job-board APIs are official and publicly documented.

How do I find a company's ATS if I don't know it?
Look for the "Powered by Greenhouse / Lever / Ashby" badge on their careers page, or inspect the job-listing URL — the token is the path segment between the domain and /jobs/.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/ats-tech-stack-scraper.

Free $5 trial credit, no credit card. Drop in Airtable on Greenhouse and Ramp on Ashby, set minTechsDetected: 2, and you'll have a cross-company tech-stack dataset in under a minute. Building a Clay recipe or a qualification workflow and want a specific output field? Leave a comment — I build based on what people actually need.

Further reading:

Built by Devil Scrapes — Apify Actors that do the dirty work so your dataset stays clean. 😈

DEV Community