agenthustler

Posted on Mar 26

How to Scrape Glassdoor Jobs and Reviews in 2026

#webdev #python #webscraping #tutorial

Why Scrape Glassdoor?

Glassdoor holds a treasure trove of job market intelligence — salaries, reviews, interview questions, and company ratings. Whether you're building a job board aggregator, conducting labor market research, or tracking employer brand sentiment, Glassdoor data is incredibly valuable.

In this guide, I'll walk you through scraping Glassdoor using Python with Playwright and proxy rotation — the approach that actually works in 2026.

The Challenge

Glassdoor is one of the more difficult sites to scrape:

Aggressive anti-bot detection — Cloudflare protection, fingerprinting, and behavioral analysis
Login walls — Many pages require authentication to view full content
Dynamic rendering — Heavy JavaScript that simple HTTP requests can't handle
Rate limiting — Quick IP bans for suspicious patterns

Traditional requests + BeautifulSoup won't cut it here. You need browser automation with smart proxy rotation.

Setting Up Your Environment

pip install playwright
playwright install chromium

Basic Glassdoor Scraper with Playwright

import asyncio
from playwright.async_api import async_playwright
import json
import random

async def scrape_glassdoor_jobs(search_term, location, max_pages=3):
    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=['--disable-blink-features=AutomationControlled']
        )

        context = await browser.new_context(
            viewport={'width': 1920, 'height': 1080},
            user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        )

        page = await context.new_page()

        # Navigate to job search
        url = f'https://www.glassdoor.com/Job/jobs.htm?sc.keyword={search_term}&locT=C&locKeyword={location}'
        await page.goto(url, wait_until='networkidle')

        for page_num in range(max_pages):
            # Wait for job listings to load
            await page.wait_for_selector('[data-test="jobListing"]', timeout=15000)

            # Extract job cards
            jobs = await page.query_selector_all('[data-test="jobListing"]')

            for job in jobs:
                title = await job.query_selector('[data-test="job-title"]')
                company = await job.query_selector('[data-test="emp-name"]')
                location_el = await job.query_selector('[data-test="emp-location"]')
                salary = await job.query_selector('[data-test="detailSalary"]')

                results.append({
                    'title': await title.inner_text() if title else None,
                    'company': await company.inner_text() if company else None,
                    'location': await location_el.inner_text() if location_el else None,
                    'salary': await salary.inner_text() if salary else None,
                })

            # Random delay between pages
            await asyncio.sleep(random.uniform(2, 5))

            # Click next page
            next_btn = await page.query_selector('[data-test="pagination-next"]')
            if next_btn:
                await next_btn.click()
                await page.wait_for_load_state('networkidle')

        await browser.close()

    return results

# Run the scraper
jobs = asyncio.run(scrape_glassdoor_jobs('python developer', 'San Francisco'))
print(f'Found {len(jobs)} jobs')
for job in jobs[:5]:
    print(json.dumps(job, indent=2))

Adding Proxy Rotation

Without proxies, you'll get blocked fast. Here's how to integrate rotating proxies:

async def create_proxy_context(playwright, proxy_url):
    browser = await playwright.chromium.launch(
        headless=True,
        proxy={
            'server': proxy_url,
            'username': 'your_username',
            'password': 'your_password'
        }
    )
    return browser

# Rotate through proxy pool
PROXY_POOL = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
    'http://proxy3.example.com:8080',
]

async def get_random_proxy():
    return random.choice(PROXY_POOL)

For reliable residential proxies, I recommend ThorData — they offer rotating residential IPs that work well with Glassdoor's anti-bot measures and have competitive pricing for scraping workloads.

Scraping Company Reviews

async def scrape_reviews(company_url, max_reviews=50):
    reviews = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(company_url)

        while len(reviews) < max_reviews:
            review_cards = await page.query_selector_all('.review-details')

            for card in review_cards:
                rating = await card.query_selector('.starRating')
                title_el = await card.query_selector('.reviewLink')
                pros = await card.query_selector('[data-test="pros"]')
                cons = await card.query_selector('[data-test="cons"]')
                date_el = await card.query_selector('.subtle')

                reviews.append({
                    'rating': await rating.get_attribute('aria-label') if rating else None,
                    'title': await title_el.inner_text() if title_el else None,
                    'pros': await pros.inner_text() if pros else None,
                    'cons': await cons.inner_text() if cons else None,
                    'date': await date_el.inner_text() if date_el else None,
                })

            # Paginate
            next_btn = await page.query_selector('[data-test="pagination-next"]')
            if not next_btn:
                break
            await next_btn.click()
            await asyncio.sleep(random.uniform(3, 6))

        await browser.close()

    return reviews[:max_reviews]

The Easier Way: Pre-Built Scrapers

Building and maintaining a Glassdoor scraper is time-consuming — selectors change, anti-bot measures evolve, and login walls shift. If you need production-ready data collection, check out the Glassdoor Scraper on Apify. It handles all the complexity — proxy rotation, CAPTCHA solving, and data extraction — so you can focus on what to do with the data.

Best Practices

Respect rate limits — Add random delays between requests (2-8 seconds)
Rotate user agents — Don't use the same UA for every request
Use residential proxies — Datacenter IPs get blocked quickly
Handle failures gracefully — Implement retry logic with exponential backoff
Cache responses — Don't re-scrape pages you've already processed
Check robots.txt — Be aware of the site's scraping policies

Data Storage

import pandas as pd

def save_results(jobs, filename='glassdoor_jobs.csv'):
    df = pd.DataFrame(jobs)
    df.to_csv(filename, index=False)
    print(f'Saved {len(df)} records to {filename}')

def save_to_json(jobs, filename='glassdoor_jobs.json'):
    with open(filename, 'w') as f:
        json.dump(jobs, f, indent=2)

Conclusion

Scraping Glassdoor in 2026 requires browser automation (Playwright), rotating proxies (ThorData works great for this), and patience with anti-bot measures. For production workloads, a managed solution like the Glassdoor Scraper on Apify saves significant development and maintenance time.

Happy scraping!

DEV Community