DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Glassdoor Without Getting Blocked

Glassdoor is one of the most valuable sources for job market data, company reviews, and salary information. However, it's also one of the most challenging sites to scrape. Here's how to do it reliably.

Why Glassdoor is Hard to Scrape

Glassdoor uses several anti-bot measures:

  • Login walls for most content
  • Cloudflare protection
  • Dynamic JavaScript rendering
  • Aggressive rate limiting
  • CAPTCHA challenges

The Right Approach: Playwright + Stealth

pip install playwright
playwright install chromium
Enter fullscreen mode Exit fullscreen mode

Setting Up a Stealth Browser

from playwright.sync_api import sync_playwright
import random, time

def create_stealth_browser():
    pw = sync_playwright().start()
    browser = pw.chromium.launch(
        headless=True,
        args=["--disable-blink-features=AutomationControlled", "--no-sandbox"]
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/120.0.0.0 Safari/537.36",
        locale="en-US"
    )
    return pw, browser, context

def human_delay():
    time.sleep(random.uniform(2, 5))
Enter fullscreen mode Exit fullscreen mode

Scraping Job Listings

def scrape_glassdoor_jobs(query, location="United States", max_pages=3):
    pw, browser, context = create_stealth_browser()
    page = context.new_page()
    jobs = []

    try:
        search_url = f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={query}&locT=N&locId=1"
        page.goto(search_url, wait_until="networkidle")
        human_delay()

        for page_num in range(max_pages):
            page.wait_for_selector('[data-test="jobListing"]', timeout=10000)
            listings = page.query_selector_all('[data-test="jobListing"]')

            for listing in listings:
                title_el = listing.query_selector('[data-test="job-title"]')
                company_el = listing.query_selector('[data-test="emp-name"]')
                location_el = listing.query_selector('[data-test="emp-location"]')

                jobs.append({
                    "title": title_el.inner_text() if title_el else "",
                    "company": company_el.inner_text() if company_el else "",
                    "location": location_el.inner_text() if location_el else ""
                })

            next_btn = page.query_selector('[data-test="pagination-next"]')
            if next_btn:
                next_btn.click()
                human_delay()
            else:
                break
    finally:
        browser.close()
        pw.stop()
    return jobs
Enter fullscreen mode Exit fullscreen mode

Extracting Salary Data

def scrape_salaries(company_slug):
    pw, browser, context = create_stealth_browser()
    page = context.new_page()
    salaries = []

    try:
        url = f"https://www.glassdoor.com/Salary/{company_slug}-Salaries.htm"
        page.goto(url, wait_until="networkidle")
        human_delay()

        rows = page.query_selector_all('[data-test="salaries-list-item"]')
        for row in rows:
            title = row.query_selector('[data-test="salary-title"]')
            pay = row.query_selector('[data-test="salary-amount"]')
            if title and pay:
                salaries.append({
                    "job_title": title.inner_text(),
                    "salary": pay.inner_text()
                })
    finally:
        browser.close()
        pw.stop()
    return salaries
Enter fullscreen mode Exit fullscreen mode

Using Proxy Rotation

Glassdoor is aggressive about blocking IPs. Using a proxy service is essential. ScraperAPI handles IP rotation and JavaScript rendering:

import requests

def scrape_via_proxy(url):
    params = {
        "api_key": "YOUR_KEY",
        "url": url,
        "render": "true",
        "country_code": "us"
    }
    response = requests.get("http://api.scraperapi.com", params=params)
    return response.text
Enter fullscreen mode Exit fullscreen mode

For residential proxy rotation, ThorData provides IPs that look like real users, which is critical for sites with strong anti-bot measures.

Best Practices

  1. Rate limit aggressively — 1 request every 3-5 seconds minimum
  2. Rotate user agents — maintain a pool of 20+ realistic user agent strings
  3. Use sessions wisely — don't create a new session for every request
  4. Handle CAPTCHAs gracefully — back off when you encounter them
  5. Cache results — don't re-scrape data you already have

Monitoring Your Scrapers

Track your scraper's performance with ScrapeOps. Monitor success rates, response times, and detect when Glassdoor changes its anti-bot measures.

Legal Considerations

Always check Glassdoor's Terms of Service before scraping. Use the data for personal research and analysis. Don't republish scraped content or use it for competitive intelligence without proper legal review.

Conclusion

Scraping Glassdoor requires patience and the right tools. Combine browser automation with proxy rotation, add human-like delays, and always respect the site's resources. The salary and review data is incredibly valuable for job market research when collected responsibly.

Top comments (0)