agenthustler

Posted on Apr 9

How to Scrape Glassdoor in 2026: Jobs, Reviews, and Salary Data

#webscraping #python #datascience #automation

Glassdoor remains one of the richest sources of employment data on the web — job listings, company reviews, salary ranges, interview experiences, and benefits information. For data engineers building HR tech platforms, recruiters creating competitive intelligence tools, or researchers analyzing labor market trends, programmatic access to this data is essential.

This guide covers the practical techniques for extracting Glassdoor data in 2026, including the challenges you'll face and production-ready code to get you started.

Understanding Glassdoor's Data Structure

Before writing any code, it helps to understand what Glassdoor exposes and how it's organized.

Job Listings are the most straightforward. Each listing includes title, company, location, salary estimate (when available), posting date, and a detailed description. Jobs are organized by search queries and filters — location, salary range, company size, and job type.

Company Reviews are structured with an overall rating (1-5), sub-ratings (culture, work-life balance, compensation, management, career opportunities), pros/cons text, employment status, job title of the reviewer, and review date. Reviews are paginated — typically 10 per page.

Salary Data includes job title, company, base pay range (low/median/high), total compensation, years of experience, and location. This is arguably the most valuable dataset Glassdoor offers.

Glassdoor URLs follow predictable patterns:

# Job listings
https://www.glassdoor.com/Job/san-francisco-python-developer-jobs-SRCH_IL.0,13_IC1147401_KO14,30.htm

# Company reviews
https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm

# Salary data
https://www.glassdoor.com/Salary/Google-Salaries-E9079.htm

The employer ID (e.g., E9079 for Google) is the key linking entity across all data types.

Setting Up Your Scraping Environment

Glassdoor is a JavaScript-heavy application, so you'll need a browser automation approach for most data types. Here's the recommended stack:

# requirements.txt
playwright==1.44.0
selectolax==0.3.21
httpx==0.27.0

Install and set up:

pip install playwright selectolax httpx
playwright install chromium

Here's the base scraper class:

import asyncio
import json
import random
from playwright.async_api import async_playwright


class GlassdoorScraper:
    def __init__(self, headless=True):
        self.headless = headless
        self.base_url = "https://www.glassdoor.com"

    async def init_browser(self):
        self.pw = await async_playwright().start()
        self.browser = await self.pw.chromium.launch(
            headless=self.headless,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
            ]
        )
        self.context = await self.browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
        )
        # Remove webdriver flag
        await self.context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
        )
        self.page = await self.context.new_page()

    async def random_delay(self, min_sec=1.5, max_sec=4.0):
        await asyncio.sleep(random.uniform(min_sec, max_sec))

    async def close(self):
        await self.browser.close()
        await self.pw.stop()

Scraping Job Listings

Job listings are the easiest entry point. Glassdoor loads job data both in the HTML and via XHR requests. Intercepting the API calls is more reliable than parsing the DOM:

async def scrape_jobs(self, query, location, max_pages=5):
    jobs = []
    api_responses = []

    # Intercept GraphQL responses
    async def handle_response(response):
        if "api-cloud" in response.url and response.status == 200:
            try:
                data = await response.json()
                api_responses.append(data)
            except Exception:
                pass

    self.page.on("response", handle_response)

    search_url = (
        f"{self.base_url}/Job/{location}-{query}-jobs-"
        f"SRCH_KO0,{len(query)}.htm"
    )
    await self.page.goto(search_url, wait_until="networkidle")
    await self.random_delay()

    for page_num in range(max_pages):
        # Extract job cards from the page
        job_cards = await self.page.query_selector_all(
            '[data-test="jobListing"]'
        )

        for card in job_cards:
            title_el = await card.query_selector('[data-test="job-title"]')
            company_el = await card.query_selector(
                '[data-test="emp-name"]'
            )
            location_el = await card.query_selector(
                '[data-test="emp-location"]'
            )
            salary_el = await card.query_selector(
                '[data-test="detailSalary"]'
            )

            job = {
                "title": await title_el.inner_text() if title_el else None,
                "company": (
                    await company_el.inner_text() if company_el else None
                ),
                "location": (
                    await location_el.inner_text() if location_el else None
                ),
                "salary": (
                    await salary_el.inner_text() if salary_el else None
                ),
            }
            jobs.append(job)

        # Navigate to next page
        next_btn = await self.page.query_selector(
            'button[data-test="pagination-next"]'
        )
        if not next_btn or not await next_btn.is_enabled():
            break

        await next_btn.click()
        await self.page.wait_for_load_state("networkidle")
        await self.random_delay(2.0, 5.0)

    return jobs

Extracting Company Reviews

Reviews require more careful handling because Glassdoor actively protects this data. You'll often need to dismiss login modals and handle lazy-loaded content:

async def scrape_reviews(self, employer_id, max_pages=10):
    reviews = []
    url = f"{self.base_url}/Reviews/Company-Reviews-{employer_id}.htm"

    await self.page.goto(url, wait_until="networkidle")
    await self.random_delay()

    # Dismiss any modal overlays
    try:
        close_btn = await self.page.wait_for_selector(
            '[data-test="close-modal"], .modal_closeIcon',
            timeout=3000,
        )
        if close_btn:
            await close_btn.click()
    except Exception:
        pass

    for page_num in range(max_pages):
        review_elements = await self.page.query_selector_all(
            '[data-test="employerReview"]'
        )

        for el in review_elements:
            rating_el = await el.query_selector(
                '[class*="ratingNumber"]'
            )
            title_el = await el.query_selector(
                '[data-test="review-details-title"]'
            )
            pros_el = await el.query_selector(
                '[data-test="review-text-pros"]'
            )
            cons_el = await el.query_selector(
                '[data-test="review-text-cons"]'
            )
            date_el = await el.query_selector(
                '[data-test="review-details-date"]'
            )

            review = {
                "rating": (
                    await rating_el.inner_text() if rating_el else None
                ),
                "title": (
                    await title_el.inner_text() if title_el else None
                ),
                "pros": (
                    await pros_el.inner_text() if pros_el else None
                ),
                "cons": (
                    await cons_el.inner_text() if cons_el else None
                ),
                "date": (
                    await date_el.inner_text() if date_el else None
                ),
            }
            reviews.append(review)

        # Paginate
        next_btn = await self.page.query_selector(
            'button[data-test="pagination-next"]'
        )
        if not next_btn or not await next_btn.is_enabled():
            break

        await next_btn.click()
        await self.page.wait_for_load_state("networkidle")
        await self.random_delay(3.0, 6.0)

    return reviews

Collecting Salary Data

Salary data is the most commercially valuable dataset on Glassdoor. The data is partially rendered server-side, which can simplify extraction:

async def scrape_salaries(self, employer_id, max_pages=5):
    salaries = []
    url = f"{self.base_url}/Salary/Company-Salaries-{employer_id}.htm"

    await self.page.goto(url, wait_until="networkidle")
    await self.random_delay()

    for page_num in range(max_pages):
        salary_rows = await self.page.query_selector_all(
            '[data-test="salaries-list-item"]'
        )

        for row in salary_rows:
            title_el = await row.query_selector(
                '[data-test="salaries-list-item-job-title"]'
            )
            pay_el = await row.query_selector(
                '[data-test="salaries-list-item-salary-info"]'
            )

            salary = {
                "job_title": (
                    await title_el.inner_text() if title_el else None
                ),
                "pay_range": (
                    await pay_el.inner_text() if pay_el else None
                ),
                "employer_id": employer_id,
            }
            salaries.append(salary)

        next_btn = await self.page.query_selector(
            'button[data-test="pagination-next"]'
        )
        if not next_btn or not await next_btn.is_enabled():
            break

        await next_btn.click()
        await self.page.wait_for_load_state("networkidle")
        await self.random_delay(2.0, 5.0)

    return salaries

Handling Anti-Bot Protection

Glassdoor uses several layers of protection. Here's what you'll encounter and how to handle each:

1. Rate Limiting

Glassdoor will throttle or block IPs that make too many requests. Space your requests and rotate proxies:

import itertools


class ProxyRotator:
    def __init__(self, proxies):
        self.cycle = itertools.cycle(proxies)

    def next(self):
        return next(self.cycle)


# Usage with Playwright
proxy_rotator = ProxyRotator([
    "http://user:pass@proxy1.example.com:8080",
    "http://user:pass@proxy2.example.com:8080",
    "http://user:pass@proxy3.example.com:8080",
])

context = await browser.new_context(
    proxy={"server": proxy_rotator.next()}
)

2. Login Walls

Glassdoor prompts for login after viewing a few pages. You can dismiss the modal or, for larger scrapes, authenticate with a session:

async def dismiss_login_modal(self):
    """Dismiss Glassdoor login prompts."""
    selectors = [
        '[data-test="close-modal"]',
        ".modal_closeIcon",
        "button[aria-label='Close']",
    ]
    for selector in selectors:
        try:
            btn = await self.page.wait_for_selector(
                selector, timeout=2000
            )
            if btn:
                await btn.click()
                return True
        except Exception:
            continue
    return False

3. Fingerprinting

Glassdoor checks browser fingerprints. The --disable-blink-features=AutomationControlled flag and webdriver override in our base class handle the basics. For production workloads, consider using a stealth plugin or undetected-chromedriver equivalent.

4. CAPTCHA Challenges

For high-volume scraping, you'll eventually hit CAPTCHAs. At that point, it's worth considering a managed solution rather than building CAPTCHA-solving infrastructure yourself.

Production-Ready Alternative

Building and maintaining a Glassdoor scraper is significant ongoing work — selectors change, anti-bot measures evolve, and edge cases multiply. If you need reliable, production-grade data extraction, consider using a managed scraping platform.

Our Glassdoor Scraper on Apify handles all the complexity — proxy rotation, anti-bot evasion, automatic retries, and structured JSON output. It's ready to integrate into your pipeline with a simple API call:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("cryptosignals/glassdoor-scraper").call(
    run_input={
        "searchQuery": "python developer",
        "location": "San Francisco",
        "maxResults": 100,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

This is especially valuable when you need to focus on building your application logic rather than maintaining scraping infrastructure.

Practical Use Cases

Salary Benchmarking Tools: Build internal compensation analysis tools that compare your company's pay ranges against market data. HR teams use this to stay competitive in hiring without overpaying.

Job Market Analysis: Track hiring trends across industries, locations, and seniority levels. Identify which roles are growing, which are contracting, and where talent shortages exist.

Recruiting Intelligence: Build tools that surface companies with low employee satisfaction scores — these companies likely have higher turnover and more receptive candidates.

Competitive Analysis Dashboards: Monitor competitor reviews over time to identify cultural shifts, management changes, or emerging problems that might affect their talent pipeline.

Academic Research: Labor economists and organizational behavior researchers use Glassdoor data to study wage transparency, review sentiment, and labor market dynamics at scale.

Conclusion

Glassdoor scraping in 2026 requires browser automation, proxy rotation, and careful rate limiting. The data is valuable enough to justify the engineering investment — salary data alone powers an entire category of HR tech products.

Start with the code examples above for prototyping, and scale to a managed solution when you need reliability. Whatever you build, respect rate limits, cache aggressively, and focus on the data that actually drives your use case.

The employment data market is growing fast. The teams that can reliably access, structure, and analyze this data have a real competitive advantage in HR tech, recruiting, and workforce analytics.

DEV Community