Why Scrape Glassdoor?
Glassdoor holds a treasure trove of job market intelligence — salaries, reviews, interview questions, and company ratings. Whether you're building a job board aggregator, conducting labor market research, or tracking employer brand sentiment, Glassdoor data is incredibly valuable.
In this guide, I'll walk you through scraping Glassdoor using Python with Playwright and proxy rotation — the approach that actually works in 2026.
The Challenge
Glassdoor is one of the more difficult sites to scrape:
- Aggressive anti-bot detection — Cloudflare protection, fingerprinting, and behavioral analysis
- Login walls — Many pages require authentication to view full content
- Dynamic rendering — Heavy JavaScript that simple HTTP requests can't handle
- Rate limiting — Quick IP bans for suspicious patterns
Traditional requests + BeautifulSoup won't cut it here. You need browser automation with smart proxy rotation.
Setting Up Your Environment
pip install playwright
playwright install chromium
Basic Glassdoor Scraper with Playwright
import asyncio
from playwright.async_api import async_playwright
import json
import random
async def scrape_glassdoor_jobs(search_term, location, max_pages=3):
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=['--disable-blink-features=AutomationControlled']
)
context = await browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
)
page = await context.new_page()
# Navigate to job search
url = f'https://www.glassdoor.com/Job/jobs.htm?sc.keyword={search_term}&locT=C&locKeyword={location}'
await page.goto(url, wait_until='networkidle')
for page_num in range(max_pages):
# Wait for job listings to load
await page.wait_for_selector('[data-test="jobListing"]', timeout=15000)
# Extract job cards
jobs = await page.query_selector_all('[data-test="jobListing"]')
for job in jobs:
title = await job.query_selector('[data-test="job-title"]')
company = await job.query_selector('[data-test="emp-name"]')
location_el = await job.query_selector('[data-test="emp-location"]')
salary = await job.query_selector('[data-test="detailSalary"]')
results.append({
'title': await title.inner_text() if title else None,
'company': await company.inner_text() if company else None,
'location': await location_el.inner_text() if location_el else None,
'salary': await salary.inner_text() if salary else None,
})
# Random delay between pages
await asyncio.sleep(random.uniform(2, 5))
# Click next page
next_btn = await page.query_selector('[data-test="pagination-next"]')
if next_btn:
await next_btn.click()
await page.wait_for_load_state('networkidle')
await browser.close()
return results
# Run the scraper
jobs = asyncio.run(scrape_glassdoor_jobs('python developer', 'San Francisco'))
print(f'Found {len(jobs)} jobs')
for job in jobs[:5]:
print(json.dumps(job, indent=2))
Adding Proxy Rotation
Without proxies, you'll get blocked fast. Here's how to integrate rotating proxies:
async def create_proxy_context(playwright, proxy_url):
browser = await playwright.chromium.launch(
headless=True,
proxy={
'server': proxy_url,
'username': 'your_username',
'password': 'your_password'
}
)
return browser
# Rotate through proxy pool
PROXY_POOL = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
]
async def get_random_proxy():
return random.choice(PROXY_POOL)
For reliable residential proxies, I recommend ThorData — they offer rotating residential IPs that work well with Glassdoor's anti-bot measures and have competitive pricing for scraping workloads.
Scraping Company Reviews
async def scrape_reviews(company_url, max_reviews=50):
reviews = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(company_url)
while len(reviews) < max_reviews:
review_cards = await page.query_selector_all('.review-details')
for card in review_cards:
rating = await card.query_selector('.starRating')
title_el = await card.query_selector('.reviewLink')
pros = await card.query_selector('[data-test="pros"]')
cons = await card.query_selector('[data-test="cons"]')
date_el = await card.query_selector('.subtle')
reviews.append({
'rating': await rating.get_attribute('aria-label') if rating else None,
'title': await title_el.inner_text() if title_el else None,
'pros': await pros.inner_text() if pros else None,
'cons': await cons.inner_text() if cons else None,
'date': await date_el.inner_text() if date_el else None,
})
# Paginate
next_btn = await page.query_selector('[data-test="pagination-next"]')
if not next_btn:
break
await next_btn.click()
await asyncio.sleep(random.uniform(3, 6))
await browser.close()
return reviews[:max_reviews]
The Easier Way: Pre-Built Scrapers
Building and maintaining a Glassdoor scraper is time-consuming — selectors change, anti-bot measures evolve, and login walls shift. If you need production-ready data collection, check out the Glassdoor Scraper on Apify. It handles all the complexity — proxy rotation, CAPTCHA solving, and data extraction — so you can focus on what to do with the data.
Best Practices
- Respect rate limits — Add random delays between requests (2-8 seconds)
- Rotate user agents — Don't use the same UA for every request
- Use residential proxies — Datacenter IPs get blocked quickly
- Handle failures gracefully — Implement retry logic with exponential backoff
- Cache responses — Don't re-scrape pages you've already processed
- Check robots.txt — Be aware of the site's scraping policies
Data Storage
import pandas as pd
def save_results(jobs, filename='glassdoor_jobs.csv'):
df = pd.DataFrame(jobs)
df.to_csv(filename, index=False)
print(f'Saved {len(df)} records to {filename}')
def save_to_json(jobs, filename='glassdoor_jobs.json'):
with open(filename, 'w') as f:
json.dump(jobs, f, indent=2)
Conclusion
Scraping Glassdoor in 2026 requires browser automation (Playwright), rotating proxies (ThorData works great for this), and patience with anti-bot measures. For production workloads, a managed solution like the Glassdoor Scraper on Apify saves significant development and maintenance time.
Happy scraping!
Top comments (0)