Valentina Skakun for HasData

Posted on Mar 2

Building a Production-Ready Job Board Scraper with Python

#python #programming #scraping #hasdata

For a simpler approach using CSS selectors and BeautifulSoup, check out our beginner's guide to job scraping. This article covers advanced techniques for production-scale job data extraction.

Effective job scraping pipelines prioritize hidden API endpoints over brittle DOM parsing. Most tech companies rely on Applicant Tracking Systems like Greenhouse or Lever that expose structured JSON data. Directly consuming these endpoints reduces ban rates and ensures type-safe data extraction.

We reserve resource-intensive headless browsers for complex Single Page Applications like Workday or protected aggregators. The following reference implementation demonstrates this logic by automatically detecting hidden APIs for major hiring platforms.

The Universal AI Job Scraper
Job Data Pipeline Architecture
- The Platform Specific Strategy
- Identifying the Underlying ATS
Direct ATS Integration Patterns
Handling Complex Enterprise ATS
Bypassing Aggregator Defenses
Job Data Normalization
Scaling and Infrastructure
- Residential Proxy Rotation Logic
- TLS Fingerprint Management
Conclusion

The Universal AI Job Scraper

If you need a resilient solution that parses job postings from any domain (Indeed, Greenhouse, Company Careers) without writing custom selectors, use AI extraction. The following script uses HasData's scraping API to handle residential proxy rotation, JavaScript rendering, and Large Language Model (LLM) to normalize unstructured HTML into a strict JSON schema.

import requests
import json
from typing import Dict, Any

API_KEY = "YOUR_HASDATA_API_KEY"

def scrape_job_posting(url: str, api_key: str) -> Dict[str, Any]:
    """
    Scrapes any job posting URL using AI-powered extraction.
    Bypasses anti-bot protections and normalizes data to a strict schema.
    """
    payload = {
        "url": url,
        "proxyType": "residential", # Essential for bypassing Cloudflare on Indeed/Glassdoor
        "proxyCountry": "US",
        "jsRendering": True,        # Renders React/Angular based ATS pages (Workday)
        "aiExtractRules": {
            "jobs": {
                "type": "list",
                "output": {
                    "title": {"type": "string", "description": "Job title"},
                    "company": {"type": "string", "description": "Company name"},
                    "location": {"type": "string", "description": "Job location (city, state, remote)"},
                    "salary_range": {"type": "string", "description": "Salary range or hourly rate if mentioned"},
                    "job_type": {"type": "string", "description": "Full-time, Part-time, Contract, etc."},
                    "requirements": {"type": "string", "description": "List of technical requirements and skills"},
                    "posted_date": {"type": "string", "description": "Date posted in ISO format"},
                    "application_url": {"type": "string", "description": "Direct URL to apply"}
                }
            }
        }
    }

    try:
        response = requests.post(
            "https://api.hasdata.com/scrape/web",
            headers={
                "x-api-key": api_key,
                "Content-Type": "application/json"
            },
            json=payload,
            timeout=60
        )
        response.raise_for_status()

        # Return the AI extracted content directly
        return response.json().get('scrapingResult', {}).get('extract', {})

    except requests.exceptions.RequestException as e:
        print(f"Extraction failed: {e}")
        return {}

# Usage Example
data = scrape_job_posting("https://www.indeed.com/q-usa-jobs.html?vjk=a81504eef3366054", API_KEY)
print(json.dumps(data, indent=2))

Extraction example:

{
  "jobs": [
    {
      "title": "Voice recording (Spanish Accent, Latin America)",
      "company": "Mercor",
      "location": "New York, NY",
      "salary_range": "$25 - $35 an hour",
      "job_type": "Full-time",
      "requirements": "We're collecting ~3 hours of phone audio...",
      "application_url": "/rc/clk?jk=ccaa0338f396988e..."
    },
    {
      "title": "WAREHOUSE ATTENDANT (PART TIME)",
      "company": "Restaurant Associates",
      "location": "Chicago, IL 60607",
      "salary_range": "$32.43 an hour",
      "job_type": "Part-time",
      "requirements": "Previous food service experience is preferred.",
      "application_url": "/rc/clk?jk=32c48f4daa02f665..."
    }
  ]
}

The jsRendering parameter ensures that Single Page Applications are fully loaded before the AI attempts extraction, while the residential proxy setting handles the IP reputation checks common on aggregators.

Job Data Pipeline Architecture

Efficient scraping at scale requires a routing layer. Treating every URL as a generic HTML page wastes resources and increases ban rates. You must identify the underlying Applicant Tracking System (ATS) to select the optimal extraction strategy.

The Platform Specific Strategy

High-volume scraping requires cost efficiency. Hitting a public JSON endpoint costs a fraction of a cent in compute time while rendering a full browser session for Workday costs significantly more in RAM and proxy bandwidth.

Your pipeline should implement a waterfall logic flow:

Check for Known APIs. If the URL matches a known ATS like Greenhouse, route to a lightweight requests handler.

Check for Enterprise SPAs. If the target is Workday or Taleo, route to a Playwright instance with session management.

Fallback to Universal Extraction. For unknown domains or protected aggregators like Indeed, route to the AI-powered extraction API demonstrated in the previous section.

This decision tree ensures you only pay the "rendering tax" when absolutely necessary.

Identifying the Underlying ATS

Before initiating a scrape, detect the ATS provider to select the correct parser. Most career pages function as a white-label frontend for a major provider. Analyze the URL structure or window objects to identify the backend.

from typing import Optional

def detect_ats_provider(url: str) -> str:
    """
    Routes job URLs to specific scrapers based on domain signatures.
    """
    ats_patterns = {
        'greenhouse': ['greenhouse.io', 'boards.greenhouse.io'],
        'lever': ['lever.co', 'jobs.lever.co'],
        'workday': ['myworkdayjobs.com', 'workday.com'],
        'ashby': ['ashbyhq.com', 'jobs.ashbyhq.com'],
        'bamboohr': ['bamboohr.com'],
        'icims': ['icims.com'],
        'smartrecruiters': ['smartrecruiters.com'],
        'jobvite': ['jobvite.com'],
        'taleo': ['taleo.net'],
        'linkedin': ['linkedin.com/jobs'],
        'indeed': ['indeed.com']
    }

    url_lower = url.lower()
    for ats, patterns in ats_patterns.items():
        if any(pattern in url_lower for pattern in patterns):
            return ats

    return 'unknown'

Use a lightweight check to route traffic before spawning a heavy browser instance.

Direct ATS Integration Patterns

Once your routing layer identifies the provider, use these specific extraction patterns. Platforms like Greenhouse and Lever expose public JSON endpoints used to power their frontend UIs. For more complex platforms like Ashby, we use a hybrid approach.

Greenhouse JSON Endpoints

Greenhouse is the most developer-friendly ATS. Every job board has a corresponding API endpoint that returns structured data without authentication. You only need the board token, which is usually the company name found in the URL.

import requests
from typing import List, Dict

def fetch_greenhouse_jobs(board_token: str) -> List[Dict]:
    """
    Fetches raw job data directly from Greenhouse's public API.
    No proxies or rendering required.
    """
    url = f"https://boards-api.greenhouse.io/v1/boards/{board_token}/jobs?content=true"

    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return response.json().get('jobs', [])
    except requests.RequestException as e:
        print(f"Greenhouse API Error: {e}")
        return []

# Usage
# jobs = fetch_greenhouse_jobs("hasdata")

This method returns clean JSON with fields like absolute_url, updated_at, and metadata. Unlike HTML parsing, this endpoint remains stable even if the company completely redesigns their career page CSS.

Lever Hidden API Endpoints

Lever follows a similar pattern but uses a different endpoint structure. While they do not document this publicly for scrapers, the API is open to allow for client-side rendering of job lists.

def fetch_lever_jobs(company_name: str) -> List[Dict]:
    """
    Targeting Lever's internal posting API.
    """
    url = f"https://api.lever.co/v0/postings/{company_name}?mode=json"

    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        print(f"Lever API Error: {e}")
        return []

The response includes rich data often hidden from the HTML view, including internal team categorizations, commitment levels (Full-time/Contract), and hiring manager details if exposed.

Handling Ashby and BambooHR

Newer platforms like Ashby and legacy systems like BambooHR are less consistent. Ashby heavily relies on React hydration, making direct API calls difficult due to complex query parameters. For these targets, we revert to the AI extraction method to handle the JavaScript rendering and DOM parsing.

def scrape_ashby_or_bamboo(url: str, api_key: str) -> Dict:
    """
    Fallback method for JS-heavy ATS platforms using HasData AI.
    """
    payload = {
        "url": url,
        "proxyType": "residential",
        "jsRendering": True, # Essential for Ashby's React hydration
        "aiExtractRules": {
            "jobs": {
                "type": "list",
                "output": {
                    "title": {"type": "string", "description": "Job title"},
                    "company": {"type": "string", "description": "Company name"},
                    "location": {"type": "string", "description": "Job location (city, state, remote)"},
                    "salary_range": {"type": "string", "description": "Salary range or hourly rate if mentioned"},
                    "job_type": {"type": "string", "description": "Full-time, Part-time, Contract, etc."},
                    "requirements": {"type": "string", "description": "List of technical requirements and skills"},
                    "posted_date": {"type": "string", "description": "Date posted in ISO format"}
                }
            }
        }
    }

    response = requests.post(
        "https://api.hasdata.com/scrape/web",
        headers={"x-api-key": api_key, "Content-Type": "application/json"},
        json=payload
    )

    return response.json().get("aiResponse", {}).get("jobs", [])

Using this hybrid approach allows you to scrape 80% of your targets (Greenhouse/Lever) with zero infrastructure cost, reserving your paid scraping credits for the 20% of difficult targets (Ashby/BambooHR).

Handling Complex Enterprise ATS

Enterprise platforms like Workday, Taleo, and iCIMS represent the hardest tier of job scraping. These Single Page Applications (SPAs) rely on stateful sessions, dynamic CSRF tokens, and heavy JavaScript execution. Sending a simple GET request will return a 200 OK status with a loading spinner but no data. To scrape these, you must move from HTTP clients to browser orchestration.

Playwright Network Interception

Parsing the DOM of a Workday site is inefficient because the HTML classes are often obfuscated and nested within multiple div layers. However, the data is loaded via internal JSON APIs after the page renders. Instead of scraping the visual elements, configure Playwright to intercept the background network traffic.

from playwright.sync_api import sync_playwright

def scrape_workday_via_interception(url: str):
    """
    Loads Workday in a headless browser but extracts data
    by intercepting the internal XHR response.
    """
    jobs_data = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Define the interception logic before navigation
        def handle_response(response):
            # Workday typically serves job data from endpoints containing 'wday/cxs'
            if "wday/cxs" in response.url and response.request.resource_type in ["fetch", "xhr"]:
                try:
                    data = response.json()
                    # Append valid job listings to our list
                    if "jobPostings" in data:
                        jobs_data.extend(data["jobPostings"])
                except Exception:
                    pass

        # Attach the event listener
        page.on("response", handle_response)

        # Trigger the page load which fires the XHR requests
        page.goto(url, wait_until="networkidle")

        browser.close()
        return jobs_data

This approach allows you to extract clean, structured JSON directly from the server response, bypassing the need to maintain complex CSS selectors for the HTML interface.

Handling Workday CSRF and Sessions

Workday is aggressive about session management. It generates a wday_vps_cookie and a corresponding CSRF token upon the first page load. If you attempt to use Python requests without these, the server rejects the connection.

If you need to scale scraping without keeping a browser open, use a hybrid approach:

Launch Playwright to load the initial page
Extract the Cookie header and X-CSRF-Token from the network logs
Inject these credentials into a high-performance HTTP client (like requests or httpx) to iterate through pagination pages

This method reduces the RAM overhead by 80% compared to clicking "Next" in a browser instance for every page.

Simulating Human Inputs for Bot Defense

Enterprise firewalls often fingerprint TLS handshakes and analyze interaction patterns. To avoid detection when using headless browsers, you must actively mask automation signals. This starts by scrubbing the navigator.webdriver property to ensure it remains undefined.

Beyond static properties, you must simulate organic behavior by replacing fixed time.sleep() calls with randomized intervals and implementing non-linear mouse movements, as sophisticated platforms like Taleo analyze cursor entropy to identify bots.

import random
import time

def humanize_interaction(page):
    """
    Injects random mouse movements and variable delays to mimic human behavior.
    """
    # 1. Randomize mouse movement
    for _ in range(3):
        x = random.randint(100, 800)
        y = random.randint(100, 600)
        page.mouse.move(x, y)
        time.sleep(random.uniform(0.1, 0.3))

    # 2. Variable scroll to trigger lazy loading
    page.mouse.wheel(0, random.randint(300, 700))
    time.sleep(random.uniform(0.5, 1.5))

Implementing these heuristics is mandatory when scraping enterprise portals to prevent your IP from being flagged as a bot.

Bypassing Aggregator Defenses

Aggregators like LinkedIn, Indeed, and Glassdoor employ the most sophisticated anti-bot systems, including TLS fingerprinting and behavioral analysis. Scraping their search result pages directly is resource-intensive and leads to rapid IP exhaustion. A more scalable architecture involves delegating the "Discovery" phase to Google and reserving direct scraping only for the final extraction.

Google SERP Discovery Strategy

Instead of fighting Indeed's pagination and rate limits, treat Google as your discovery layer. Google has already indexed the job postings. You can extract direct URLs to the job listings by scraping Google SERP using advanced search operators.

Use the site: operator to filter for direct listing pages:

LinkedIn: site:linkedin.com/jobs/view "python developer"
Indeed: site:indeed.com/viewjob "data engineer"
Greenhouse: site:boards.greenhouse.io "backend"

The following script uses the HasData SERP API to fetch targeted job URLs. This bypasses the need to manage captchas or IP rotation for Google searches.

import requests

def discover_jobs_via_serp(query: str, api_key: str):
    """
    Uses Google SERP to find direct job URLs, bypassing aggregator search restrictions.
    Query example: 'site:linkedin.com/jobs/view "python developer" "remote"'
    """
    url = "https://api.hasdata.com/scrape/google/serp"

    params = {
        "q": query,
        "location": "United States",
        "deviceType": "desktop",
    }

    headers = {
        'x-api-key': api_key,
        'Content-Type': "application/json"
    }

    try:
        response = requests.get(url, params=params, headers=headers, timeout=30)
        response.raise_for_status()

        data = response.json()

        # Extract organic links targeting job postings
        job_links = []
        if "organicResults" in data:
            for item in data["organicResults"]:
                if "linkedin.com/jobs/view" in item.get("link", ""):
                    job_links.append(item["link"])

        return job_links

    except requests.RequestException as e:
        print(f"SERP Discovery Failed: {e}")
        return []

# Usage
# links = discover_jobs_via_serp('site:linkedin.com/jobs/view "data engineer"', "YOUR_HASDATA_API_KEY")

This method yields clean URLs pointing directly to the job description pages, which have lighter protection than the search interfaces.

Unwinding Redirect Chains

Aggregators often mask the true source of a job with an "Apply" link that passes through a tracking redirect (e.g., indeed.com/rc/clk?jk=...). To build a high-quality dataset, you should resolve these redirects to find the canonical URL on the employer's ATS.

Once you resolve the final URL, pass it back to the Universal ATS Scraper defined in the first section. This ensures you scrape the data from the source of truth (Greenhouse/Lever) rather than the potentially stale aggregator listing.

LinkedIn and Indeed DOM Strategies

Sometimes you cannot bypass the aggregator. If you need to scrape specific metadata that only exists on the aggregator (such as "Easy Apply" status or "Number of Applicants"), you must interact with their DOM directly.

This requires advanced browser fingerprint management and residential proxies. We have created specialized guides and tools for these specific platforms. For production workloads, we recommend using the dedicated scrape/web endpoint with proxyType: residential enabled, as datacenter IPs are blocked instantly by these platforms.

Job Data Normalization

Raw job data is noisy and unstructured. Salaries appear as mixed strings like "$120k-150k" or "$60/hr" and tech stacks are buried in free text. To build a queryable dataset, you must normalize these fields into standardized integers and canonical entities.

Standardizing Salary Ranges via Regex

Salary normalization requires handling three variables: currency detection, period conversion (hourly to yearly), and abbreviation expansion ('k' notation). The following function unifies these into a single annual salary range.

import re
from typing import Dict, Any

def normalize_salary(text: str) -> Dict[str, Any]:
    """
    Parses salary strings into standardized annual integer ranges.
    Converts hourly rates to yearly (based on 2080 hours).
    """
    if not text:
        return {'min': None, 'max': None, 'currency': 'USD'}

    # 1. Define patterns for "100k-150k", "$50/hr", "€60,000"
    patterns = [
        r'(\d{1,3}(?:,\d{3})*)\s*k?\s*[-–to]\s*(\d{1,3}(?:,\d{3})*)\s*k?',
        r'\$\s*(\d{1,3}(?:,\d{3})*)\s*[-–to]\s*\$?\s*(\d{1,3}(?:,\d{3})*)',
    ]

    # 2. Detect Logic
    currency = 'GBP' if '£' in text else 'EUR' if '€' in text else 'USD'
    is_hourly = any(x in text.lower() for x in ['hour', 'hr', '/hr'])
    multiplier = 2080 if is_hourly else 1000 if 'k' in text.lower() else 1

    text_clean = re.sub(r'[$,£€]', '', text)

    for pat in patterns:
        match = re.search(pat, text_clean, re.IGNORECASE)
        if match:
            min_val = float(match.group(1).replace(',', '')) * multiplier
            max_val = float(match.group(2).replace(',', '')) * multiplier
            return {
                'min': int(min_val),
                'max': int(max_val),
                'currency': currency,
                'period': 'yearly'
            }

    return {'min': None, 'max': None, 'currency': currency}

This logic ensures that a wage of "$50/hr" and a salary of "$104,000" are treated as equivalent values in your database for accurate filtering and analysis.

Extracting Tech Stacks with FlashText

Regex is inefficient for extracting thousands of keywords from long descriptions. The Aho-Corasick algorithm (implemented via the flashtext library) performs replacement and extraction in linear time. It maps multiple variations of a term to a single canonical ID.

from flashtext import KeywordProcessor
from typing import List

def extract_tech_stack(description: str) -> List[str]:
    """
    Extracts canonical tech tags using O(N) FlashText algorithm.
    Maps variations like 'React.js' -> 'React'.
    """
    processor = KeywordProcessor(case_sensitive=False)

    # Map variations to canonical names
    tech_map = {
        'Python': ['python', 'py', 'python3'],
        'JavaScript': ['javascript', 'js', 'node.js'],
        'React': ['react', 'reactjs', 'react.js'],
        'AWS': ['aws', 'amazon web services', 'ec2'],
        'Docker': ['docker', 'containerization'],
        'Kubernetes': ['kubernetes', 'k8s'],
        'PostgreSQL': ['postgresql', 'postgres', 'psql'],
        'Go': ['golang', 'go lang']
    }

    processor.add_keywords_from_dict(tech_map)

    found_keywords = processor.extract_keywords(description)
    return sorted(list(set(found_keywords)))

Using FlashText reduces processing time by orders of magnitude compared to looping through regex patterns for every keyword. This is critical when processing millions of job descriptions.

Detecting Expired Listings

Job postings have a short lifecycle. Aggregators often display cached pages for jobs that are closed on the source ATS. Scraping "zombie jobs" pollutes your database and wastes resources.

Implement a "Death Check" logic during the scrape:

Status Code. If the ATS returns 404 or 410, the job is dead.

Redirects. Many ATS platforms (Greenhouse, Lever) redirect expired job URLs to their main career page. If response.url does not match the requested URL, mark as expired.

Schema Validation. Check the validThrough property in the JSON-LD schema if present.

import requests

def check_job_health(url: str) -> str:
    """
    Determines if a job is ACTIVE, EXPIRED, or REDIRECTED_TO_INDEX.
    """
    try:
        # Prevent auto-redirects to analyze the chain
        resp = requests.get(url, timeout=10, allow_redirects=True)

        # 1. Check Hard 404
        if resp.status_code in [404, 410]:
            return "EXPIRED_HTTP_CODE"

        # 2. Check Soft 404 (Redirect to Career Home)
        # If the final URL path is significantly shorter than the original
        if len(resp.url) < len(url) - 10 and "jobs" not in resp.url.split("/")[-1]:
            return "EXPIRED_REDIRECT_TO_INDEX"

        # 3. Check JSON-LD Validity
        # (Simplified logic: In production parse the full HTML)
        if '"validThrough"' in resp.text:
            # Extract date string and compare to datetime.now()
            pass

        return "ACTIVE"

    except requests.RequestException:
        return "UNKNOWN"

Tracking these signals allows you to automatically purge stale listings from your database and maintain high data quality for your users.

Scaling and Infrastructure

Scaling a job scraper involves more than increasing concurrent threads. Aggregators and Enterprise ATS platforms employ sophisticated defenses that detect non-human traffic patterns. To scrape at scale you must manage IP reputation and cryptographic signatures.

Residential Proxy Rotation Logic

Datacenter IPs (AWS, GCP, Azure) are flagged by default on platforms like Indeed and LinkedIn. Using them results in immediate CAPTCHAs or 403 errors. You must use residential proxies which route traffic through legitimate ISP connections.

The following function demonstrates how to use rotating residential proxies for "hard targets" with HasData API.

import requests

def make_resilient_request(url: str, api_key: str):
    """
    Configures proxy rotation based on target state requirements.
    """
    payload = {
        "url": url,
        "proxyType": "residential", # Mandatory for Indeed/LinkedIn
        "proxyCountry": "US",
    }

    return requests.post(
        "https://api.hasdata.com/scrape/web",
        headers={
                "x-api-key": api_key,
                "Content-Type": "application/json"
            },
        json=payload
    )

Using residential proxies increases the cost per request but is often the only way to access data from protected domains. Route internal ATS API calls (Greenhouse/Lever) through cheaper datacenter proxies and reserve residential bandwidth for the aggregators.

TLS Fingerprint Management

Standard Python libraries like requests or urllib leak their identity during the TLS handshake. They use OpenSSL cipher suites that differ significantly from Chrome or Firefox. Security providers like Cloudflare and Akamai analyze this JA3 Fingerprint and block connections before they even send HTTP headers.

If you see 403 Forbidden errors despite using high-quality residential proxies, your TLS fingerprint is likely the cause. You have two engineering solutions:

Client-Side Impersonation. Use libraries like curl_cffi or tls-client in Python which emulate browser-specific TLS Hello packets.

Server-Side Offloading. Use a scraping API that manages the handshake. The scrape/web endpoint demonstrated throughout this guide automatically rotates TLS fingerprints.

By addressing both IP reputation and TLS signatures you ensure your pipeline remains resilient against modern bot mitigation strategies.

Conclusion

Building a production-grade job scraper requires architectural discipline rather than simple script writing. We have demonstrated that stability comes from decoupling discovery from extraction and prioritizing hidden APIs over visual DOM elements. Your pipeline must route traffic intelligently. Send Greenhouse links to lightweight JSON parsers and reserve heavy residential proxies for hostile aggregators like Indeed.

Data quality is equally critical. You must implement strict normalization layers for salaries and tech stacks to turn raw text into queryable insights. Without these hygiene checks, your dataset will degrade due to expired listings and inconsistent formatting.

Start by implementing the Universal ATS Scraper function provided in the first section of this guide. It offers the highest return on investment by covering the majority of tech listings with zero maintenance. As you scale to aggressive aggregators, integrate the residential proxy rotation and TLS management logic to ensure long-term resilience.

DEV Community