Glassdoor remains one of the richest sources of employment data on the web — job listings, company reviews, salary ranges, interview experiences, and benefits information. For data engineers building HR tech platforms, recruiters creating competitive intelligence tools, or researchers analyzing labor market trends, programmatic access to this data is essential.
This guide covers the practical techniques for extracting Glassdoor data in 2026, including the challenges you'll face and production-ready code to get you started.
Understanding Glassdoor's Data Structure
Before writing any code, it helps to understand what Glassdoor exposes and how it's organized.
Job Listings are the most straightforward. Each listing includes title, company, location, salary estimate (when available), posting date, and a detailed description. Jobs are organized by search queries and filters — location, salary range, company size, and job type.
Company Reviews are structured with an overall rating (1-5), sub-ratings (culture, work-life balance, compensation, management, career opportunities), pros/cons text, employment status, job title of the reviewer, and review date. Reviews are paginated — typically 10 per page.
Salary Data includes job title, company, base pay range (low/median/high), total compensation, years of experience, and location. This is arguably the most valuable dataset Glassdoor offers.
Glassdoor URLs follow predictable patterns:
# Job listings
https://www.glassdoor.com/Job/san-francisco-python-developer-jobs-SRCH_IL.0,13_IC1147401_KO14,30.htm
# Company reviews
https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm
# Salary data
https://www.glassdoor.com/Salary/Google-Salaries-E9079.htm
The employer ID (e.g., E9079 for Google) is the key linking entity across all data types.
Setting Up Your Scraping Environment
Glassdoor is a JavaScript-heavy application, so you'll need a browser automation approach for most data types. Here's the recommended stack:
# requirements.txt
playwright==1.44.0
selectolax==0.3.21
httpx==0.27.0
Install and set up:
pip install playwright selectolax httpx
playwright install chromium
Here's the base scraper class:
import asyncio
import json
import random
from playwright.async_api import async_playwright
class GlassdoorScraper:
def __init__(self, headless=True):
self.headless = headless
self.base_url = "https://www.glassdoor.com"
async def init_browser(self):
self.pw = await async_playwright().start()
self.browser = await self.pw.chromium.launch(
headless=self.headless,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
]
)
self.context = await self.browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
)
# Remove webdriver flag
await self.context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
)
self.page = await self.context.new_page()
async def random_delay(self, min_sec=1.5, max_sec=4.0):
await asyncio.sleep(random.uniform(min_sec, max_sec))
async def close(self):
await self.browser.close()
await self.pw.stop()
Scraping Job Listings
Job listings are the easiest entry point. Glassdoor loads job data both in the HTML and via XHR requests. Intercepting the API calls is more reliable than parsing the DOM:
async def scrape_jobs(self, query, location, max_pages=5):
jobs = []
api_responses = []
# Intercept GraphQL responses
async def handle_response(response):
if "api-cloud" in response.url and response.status == 200:
try:
data = await response.json()
api_responses.append(data)
except Exception:
pass
self.page.on("response", handle_response)
search_url = (
f"{self.base_url}/Job/{location}-{query}-jobs-"
f"SRCH_KO0,{len(query)}.htm"
)
await self.page.goto(search_url, wait_until="networkidle")
await self.random_delay()
for page_num in range(max_pages):
# Extract job cards from the page
job_cards = await self.page.query_selector_all(
'[data-test="jobListing"]'
)
for card in job_cards:
title_el = await card.query_selector('[data-test="job-title"]')
company_el = await card.query_selector(
'[data-test="emp-name"]'
)
location_el = await card.query_selector(
'[data-test="emp-location"]'
)
salary_el = await card.query_selector(
'[data-test="detailSalary"]'
)
job = {
"title": await title_el.inner_text() if title_el else None,
"company": (
await company_el.inner_text() if company_el else None
),
"location": (
await location_el.inner_text() if location_el else None
),
"salary": (
await salary_el.inner_text() if salary_el else None
),
}
jobs.append(job)
# Navigate to next page
next_btn = await self.page.query_selector(
'button[data-test="pagination-next"]'
)
if not next_btn or not await next_btn.is_enabled():
break
await next_btn.click()
await self.page.wait_for_load_state("networkidle")
await self.random_delay(2.0, 5.0)
return jobs
Extracting Company Reviews
Reviews require more careful handling because Glassdoor actively protects this data. You'll often need to dismiss login modals and handle lazy-loaded content:
async def scrape_reviews(self, employer_id, max_pages=10):
reviews = []
url = f"{self.base_url}/Reviews/Company-Reviews-{employer_id}.htm"
await self.page.goto(url, wait_until="networkidle")
await self.random_delay()
# Dismiss any modal overlays
try:
close_btn = await self.page.wait_for_selector(
'[data-test="close-modal"], .modal_closeIcon',
timeout=3000,
)
if close_btn:
await close_btn.click()
except Exception:
pass
for page_num in range(max_pages):
review_elements = await self.page.query_selector_all(
'[data-test="employerReview"]'
)
for el in review_elements:
rating_el = await el.query_selector(
'[class*="ratingNumber"]'
)
title_el = await el.query_selector(
'[data-test="review-details-title"]'
)
pros_el = await el.query_selector(
'[data-test="review-text-pros"]'
)
cons_el = await el.query_selector(
'[data-test="review-text-cons"]'
)
date_el = await el.query_selector(
'[data-test="review-details-date"]'
)
review = {
"rating": (
await rating_el.inner_text() if rating_el else None
),
"title": (
await title_el.inner_text() if title_el else None
),
"pros": (
await pros_el.inner_text() if pros_el else None
),
"cons": (
await cons_el.inner_text() if cons_el else None
),
"date": (
await date_el.inner_text() if date_el else None
),
}
reviews.append(review)
# Paginate
next_btn = await self.page.query_selector(
'button[data-test="pagination-next"]'
)
if not next_btn or not await next_btn.is_enabled():
break
await next_btn.click()
await self.page.wait_for_load_state("networkidle")
await self.random_delay(3.0, 6.0)
return reviews
Collecting Salary Data
Salary data is the most commercially valuable dataset on Glassdoor. The data is partially rendered server-side, which can simplify extraction:
async def scrape_salaries(self, employer_id, max_pages=5):
salaries = []
url = f"{self.base_url}/Salary/Company-Salaries-{employer_id}.htm"
await self.page.goto(url, wait_until="networkidle")
await self.random_delay()
for page_num in range(max_pages):
salary_rows = await self.page.query_selector_all(
'[data-test="salaries-list-item"]'
)
for row in salary_rows:
title_el = await row.query_selector(
'[data-test="salaries-list-item-job-title"]'
)
pay_el = await row.query_selector(
'[data-test="salaries-list-item-salary-info"]'
)
salary = {
"job_title": (
await title_el.inner_text() if title_el else None
),
"pay_range": (
await pay_el.inner_text() if pay_el else None
),
"employer_id": employer_id,
}
salaries.append(salary)
next_btn = await self.page.query_selector(
'button[data-test="pagination-next"]'
)
if not next_btn or not await next_btn.is_enabled():
break
await next_btn.click()
await self.page.wait_for_load_state("networkidle")
await self.random_delay(2.0, 5.0)
return salaries
Handling Anti-Bot Protection
Glassdoor uses several layers of protection. Here's what you'll encounter and how to handle each:
1. Rate Limiting
Glassdoor will throttle or block IPs that make too many requests. Space your requests and rotate proxies:
import itertools
class ProxyRotator:
def __init__(self, proxies):
self.cycle = itertools.cycle(proxies)
def next(self):
return next(self.cycle)
# Usage with Playwright
proxy_rotator = ProxyRotator([
"http://user:pass@proxy1.example.com:8080",
"http://user:pass@proxy2.example.com:8080",
"http://user:pass@proxy3.example.com:8080",
])
context = await browser.new_context(
proxy={"server": proxy_rotator.next()}
)
2. Login Walls
Glassdoor prompts for login after viewing a few pages. You can dismiss the modal or, for larger scrapes, authenticate with a session:
async def dismiss_login_modal(self):
"""Dismiss Glassdoor login prompts."""
selectors = [
'[data-test="close-modal"]',
".modal_closeIcon",
"button[aria-label='Close']",
]
for selector in selectors:
try:
btn = await self.page.wait_for_selector(
selector, timeout=2000
)
if btn:
await btn.click()
return True
except Exception:
continue
return False
3. Fingerprinting
Glassdoor checks browser fingerprints. The --disable-blink-features=AutomationControlled flag and webdriver override in our base class handle the basics. For production workloads, consider using a stealth plugin or undetected-chromedriver equivalent.
4. CAPTCHA Challenges
For high-volume scraping, you'll eventually hit CAPTCHAs. At that point, it's worth considering a managed solution rather than building CAPTCHA-solving infrastructure yourself.
Production-Ready Alternative
Building and maintaining a Glassdoor scraper is significant ongoing work — selectors change, anti-bot measures evolve, and edge cases multiply. If you need reliable, production-grade data extraction, consider using a managed scraping platform.
Our Glassdoor Scraper on Apify handles all the complexity — proxy rotation, anti-bot evasion, automatic retries, and structured JSON output. It's ready to integrate into your pipeline with a simple API call:
from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("cryptosignals/glassdoor-scraper").call(
run_input={
"searchQuery": "python developer",
"location": "San Francisco",
"maxResults": 100,
}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)
This is especially valuable when you need to focus on building your application logic rather than maintaining scraping infrastructure.
Practical Use Cases
Salary Benchmarking Tools: Build internal compensation analysis tools that compare your company's pay ranges against market data. HR teams use this to stay competitive in hiring without overpaying.
Job Market Analysis: Track hiring trends across industries, locations, and seniority levels. Identify which roles are growing, which are contracting, and where talent shortages exist.
Recruiting Intelligence: Build tools that surface companies with low employee satisfaction scores — these companies likely have higher turnover and more receptive candidates.
Competitive Analysis Dashboards: Monitor competitor reviews over time to identify cultural shifts, management changes, or emerging problems that might affect their talent pipeline.
Academic Research: Labor economists and organizational behavior researchers use Glassdoor data to study wage transparency, review sentiment, and labor market dynamics at scale.
Conclusion
Glassdoor scraping in 2026 requires browser automation, proxy rotation, and careful rate limiting. The data is valuable enough to justify the engineering investment — salary data alone powers an entire category of HR tech products.
Start with the code examples above for prototyping, and scale to a managed solution when you need reliability. Whatever you build, respect rate limits, cache aggressively, and focus on the data that actually drives your use case.
The employment data market is growing fast. The teams that can reliably access, structure, and analyze this data have a real competitive advantage in HR tech, recruiting, and workforce analytics.
Top comments (0)