DEV Community

Cover image for How to Scrape Glassdoor: Complete Guide for 2026
AlterLab
AlterLab

Posted on • Originally published at alterlab.io

How to Scrape Glassdoor: Complete Guide for 2026

How to Scrape Glassdoor: Complete Guide for 2026

Glassdoor exposes salary data, company reviews, and job listings that are genuinely useful for compensation benchmarking, recruiting analysis, and labor market research. The catch: Glassdoor runs Cloudflare, gates salary data behind authentication, and renders all meaningful content client-side with React. This guide cuts through those obstacles with working Python code.

Why Scrape Glassdoor?

Three use cases that justify the engineering effort:

Compensation benchmarking — HR teams and SaaS products aggregate salary ranges by role, level, location, and company size. Glassdoor's crowdsourced compensation data is one of the richest publicly accessible sources for this kind of analysis. Refreshing it weekly catches market shifts before they show up in annual survey reports.

Competitive talent intelligence — Track hiring velocity at competitors. Which roles are they posting? How quickly are positions closing? Job listing volume is a reliable leading indicator of engineering and product priorities six to nine months out.

Employer brand monitoring — Tracking review sentiment over time — overall ratings, CEO approval, interview difficulty scores — gives recruiting teams early warning of culture problems before they surface as churn events. Companies also benchmark their own standing against direct competitors.

Anti-Bot Challenges on glassdoor.com

Glassdoor deploys several overlapping protections that make DIY scraping expensive to maintain:

Cloudflare WAF and Bot Management — Glassdoor sits behind Cloudflare's bot management layer. A standard Python requests call receives a JS challenge page requiring a valid cf_clearance cookie before any real HTML is served. This blocks virtually every naive scraper immediately.

Login wall for salary data — Salary ranges and detailed compensation breakdowns are gated behind authentication. Unauthenticated sessions see truncated results or get redirected to a signup modal. Full data access requires managing authenticated sessions with valid cookies.

Client-side rendering — Job listings, reviews, and salary cards are all React components. The initial HTML response from Glassdoor's server is a near-empty shell. You need a JavaScript runtime to execute the page and produce actual content.

Browser fingerprinting and behavioral detection — Glassdoor combines static browser fingerprinting with behavioral signals (scroll cadence, mouse movement, click timing) to identify headless browsers. Playwright and Puppeteer with default configurations are reliably flagged within a few page loads.

Maintaining your own bypass stack — refreshing cf_clearance cookies, managing residential proxy pools, spoofing browser fingerprints — is a real ongoing engineering commitment. AlterLab's Anti-bot bypass API handles all of this at the infrastructure level, so your scraping code stays focused on data extraction.

Quick Start with AlterLab API

Install the SDK and you can make your first Glassdoor request in under a minute. See the getting started guide for full environment setup, including API key management and optional async configuration.

```bash title="Terminal"
pip install alterlab beautifulsoup4 lxml






```python title="scrape_glassdoor.py" {4-11}

from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

# Scrape a Glassdoor job search results page
response = client.scrape(
    "https://www.glassdoor.com/Job/python-developer-jobs-SRCH_KO0,16.htm",
    render_js=True,                          # Required: Glassdoor is a React SPA
    wait_for="[data-test='jobListing']",     # Wait for job cards before returning
)

soup = BeautifulSoup(response.html, "html.parser")
job_cards = soup.select("[data-test='jobListing']")
print(f"Found {len(job_cards)} job listings")
Enter fullscreen mode Exit fullscreen mode

The equivalent cURL call for testing from a shell or integrating with non-Python pipelines:

```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.glassdoor.com/Job/python-developer-jobs-SRCH_KO0,16.htm",
"render_js": true,
"wait_for": "[data-test=\"jobListing\"]"
}'




<div data-infographic="stats">
  <div data-stat data-value="99.1%" data-label="Glassdoor Success Rate"></div>
  <div data-stat data-value="2.4s" data-label="Avg JS Render Time"></div>
  <div data-stat data-value="100%" data-label="Cloudflare Bypass Rate"></div>
  <div data-stat data-value="0ms" data-label="Proxy Setup Time"></div>
</div>

## Extracting Structured Data

With fully rendered HTML in hand, here is how to pull the most useful data points from Glassdoor's DOM.

### Job Listings

Glassdoor uses `data-test` attributes on stable semantic elements — always prefer these over generated class names, which change with every React build deployment.



```python title="parse_jobs.py" {9-22}
from bs4 import BeautifulSoup

def parse_job_listings(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "html.parser")
    jobs = []

    for card in soup.select("[data-test='jobListing']"):
        def text(selector):
            el = card.select_one(selector)
            return el.get_text(strip=True) if el else None

        jobs.append({
            "title":    text("[data-test='job-title']"),
            "company":  text("[data-test='employer-name']"),
            "location": text("[data-test='emp-location']"),
            "salary":   text("[data-test='detailSalary']"),
            "rating":   text("[data-test='rating']"),
            "age":      text("[data-test='job-age']"),
        })

    return jobs
Enter fullscreen mode Exit fullscreen mode

Company Reviews

Review pages are paginated at 10 entries per page. The _IP{n} path segment in the URL controls the page number.

```python title="parse_reviews.py" {6-22}

from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

def scrape_company_reviews(company_slug: str, pages: int = 5) -> list[dict]:
"""
company_slug: e.g. 'Google' (as it appears in the Glassdoor URL)
"""
reviews = []
slug_len = len(company_slug)

for page in range(1, pages + 1):
    url = (
        f"https://www.glassdoor.com/Reviews/{company_slug}-reviews"
        f"-SRCH_KE0,{slug_len}_IP{page}.htm"
    )
    response = client.scrape(url, render_js=True, wait_for="[data-test='review']")
    soup = BeautifulSoup(response.html, "html.parser")

    for review in soup.select("[data-test='review']"):
        def text(selector):
            el = review.select_one(selector)
            return el.get_text(strip=True) if el else None

        reviews.append({
            "headline": text("[data-test='review-title']"),
            "rating":   text("[data-test='overall-rating']"),
            "pros":     text("[data-test='pros']"),
            "cons":     text("[data-test='cons']"),
            "date":     text("[data-test='review-date']"),
            "role":     text("[data-test='author-jobTitle']"),
        })

return reviews
Enter fullscreen mode Exit fullscreen mode



### Salary Data

Salary pages require an authenticated session. Pass `glassdoor_session` and `tguid` cookies obtained from a logged-in browser profile. The API accepts a `headers` dict for this purpose:



```python title="parse_salaries.py" {5-12}

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.glassdoor.com/Salaries/software-engineer-salary-SRCH_KO0,17.htm",
    render_js=True,
    headers={
        "Cookie": "JSESSIONID=YOUR_SESSION_ID; tguid=YOUR_TGUID"
    },
    wait_for="[data-test='salaryRow']",
)
Enter fullscreen mode Exit fullscreen mode

Key selectors once authenticated: [data-test='salaryRow'] for each salary entry, [data-test='salary-estimate'] for the reported range, and [data-test='total-compensation'] for the total comp breakdown.

Common Pitfalls

Per-IP rate limiting — Glassdoor throttles at the IP level, not just by User-Agent. Exceeding roughly 25–30 requests per minute from a single IP triggers 429 responses or silent result degradation, where fewer listings are returned without any error signal. Distributed requests across rotating proxies are required for sustained collection.

Session expiry on gated content — Glassdoor sessions expire within a few hours. For pipelines that scrape salary or authenticated review data, implement cookie refresh logic. Detect redirects to /profile/login as the signal that your session has expired and re-authenticate before continuing.

Hard pagination cap — Glassdoor limits job search results to 30 pages (300 results) per query regardless of how many matching listings exist. Paginating past page 30 returns the first page again. The correct approach is to narrow queries by location, fromAge (days posted), or jobType parameter rather than paginating deeper on a broad query.

Selector drift — Glassdoor ships frontend updates frequently. Class names change with every React build. The data-test attributes documented above are more stable, but they can also shift. Build result-count validation into your pipeline: if a parse returns zero records, treat that as a selector failure, not an empty result set, and alert.

Hydration timing — Even with render_js=True, returning content before React has finished hydrating gives you an empty shell. Always set wait_for to a CSS selector matching a target element, not just a fixed timeout. The element-based wait adapts to variable page load times automatically.

Scaling Up

Batch Requests

For bulk collection across many search permutations — dozens of cities, multiple job titles, rolling date windows — the AlterLab batch endpoint processes URLs in parallel and is significantly more efficient than sequential requests:

```python title="batch_glassdoor.py" {5-20}

client = alterlab.Client("YOUR_API_KEY")

cities = [
("new-york-city", "IC1132348"),
("san-francisco", "IC1147401"),
("austin", "IC1139761"),
("seattle", "IC1150505"),
("chicago", "IC1128808"),
]

urls = [
f"https://www.glassdoor.com/Job/python-engineer-{slug}-jobs-SRCH_IL.0,{len(slug)}_IC{code}_IP{page}.htm"
for slug, code in cities
for page in range(1, 11) # 10 pages × 5 cities = 50 requests
]

results = client.batch_scrape(
urls=urls,
render_js=True,
wait_for="[data-test='jobListing']",
concurrency=10,
)

with open("glassdoor_jobs.jsonl", "w") as f:
for r in results:
if r.success:
f.write(json.dumps({"url": r.url, "html": r.html}) + "\n")




### Scheduling Recurring Pipelines

For daily job market snapshots or weekly salary index updates, wire the scraper to a scheduler. APScheduler is lightweight and runs in-process without a separate queue service:



```python title="scheduler.py" {8-16}
from apscheduler.schedulers.blocking import BlockingScheduler

from parse_jobs import parse_job_listings

client = alterlab.Client("YOUR_API_KEY")
scheduler = BlockingScheduler()

@scheduler.scheduled_job("cron", hour=2, minute=0)  # 02:00 daily
def daily_glassdoor_pull():
    roles = ["software-engineer", "data-engineer", "product-manager", "ml-engineer"]
    for role in roles:
        url = f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{len(role)}.htm"
        response = client.scrape(url, render_js=True, wait_for="[data-test='salaryRow']")
        jobs = parse_job_listings(response.html)
        store_to_warehouse(jobs)   # your storage layer here

scheduler.start()
Enter fullscreen mode Exit fullscreen mode

Cost Management at Scale

Not every Glassdoor page requires full JavaScript execution. Company overview pages and some listing shells partially pre-render server-side. Profile your target URLs: attempt a plain HTML fetch first and check whether your target selectors are present. Use render_js=False wherever possible — it is faster and consumes fewer credits. Reserve JS rendering for pages that require it.

Review AlterLab pricing for credit consumption rates broken down by request type before sizing your pipeline budget.

Key Takeaways

  • JavaScript rendering is not optional — Glassdoor's content is React-rendered. A plain HTTP fetch returns a shell. Always set render_js=True and use wait_for with a target element selector.
  • Cloudflare is the primary blocker — do not spend engineering cycles maintaining your own bypass. It is a dependency, not a competitive advantage.
  • Prefer data-test attributes over class names — class names change with every build. data-test attributes are intentionally stable for testing and are your most reliable selection strategy.
  • Salary data requires authentication — pass valid session cookies and implement refresh logic for any pipeline running longer than a few hours.
  • Respect the 30-page cap — use query narrowing (location, date posted, job type) rather than deep pagination to collect comprehensive datasets.
  • Batch and schedule deliberately — sequential requests are fine for development; batch endpoints with concurrency control are essential for production pipelines.

Related Guides

  • How to Scrape LinkedIn — professional profiles, company pages, and job postings behind one of the web's most aggressive anti-bot stacks
  • How to Scrape Indeed — job listings and employer reviews with simpler authentication requirements than Glassdoor
  • How to Scrape Amazon — product pricing, reviews, and inventory data at scale with dynamic rendering handled

Top comments (0)