agenthustler

Posted on Mar 27 • Edited on Apr 19

How to Scrape Glassdoor Salary Data for Compensation Benchmarking

#webdev #programming #python #tutorial

Salary transparency is reshaping hiring. Companies benchmark compensation, job seekers negotiate better, and recruiters calibrate offers — all using salary data. Glassdoor holds one of the richest salary datasets on the web. Here's how to extract it programmatically.

Why Glassdoor Salary Data?

Glassdoor has self-reported salary data across thousands of companies, roles, and locations. Unlike BLS data (lagging 1-2 years), Glassdoor reflects current market conditions. For building compensation tools, it's invaluable.

Technical Challenges

Glassdoor uses aggressive anti-bot measures:

Required login for salary views
Cloudflare protection
Dynamic JavaScript rendering
Session-based content gating

Setting Up the Scraper

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

ScraperAPI handles Cloudflare bypass and JavaScript rendering, which is essential for Glassdoor's dynamic salary cards.

Parsing Salary Cards

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Building a Compensation Benchmark

import pandas as pd
import re

def normalize_salary(salary_str):
    """Convert '$120K/yr' to 120000"""
    cleaned = re.sub(r"[^\d.]", "", salary_str.replace("K", "000"))
    return float(cleaned) if cleaned else None

def build_benchmark(companies, role):
    all_data = []
    for company in companies:
        salaries = scrape_glassdoor_salaries(company)
        for s in salaries:
            if role.lower() in s["job_title"].lower():
                s["company"] = company
                s["normalized_pay"] = normalize_salary(s["median_pay"])
                all_data.append(s)

    df = pd.DataFrame(all_data)
    benchmark = df.groupby("company")["normalized_pay"].agg(
        ["median", "min", "max", "count"]
    ).round(0)
    return benchmark

# Example: benchmark Software Engineer salaries
companies = ["Google", "Meta", "Amazon", "Microsoft", "Apple"]
benchmark = build_benchmark(companies, "Software Engineer")
print(benchmark.to_string())

Adding Location Normalization

Salaries vary dramatically by location. Add cost-of-living adjustments:

COL_INDEX = {
    "San Francisco": 1.0,
    "New York": 0.95,
    "Seattle": 0.88,
    "Austin": 0.72,
    "Remote": 0.80,
}

def adjust_for_location(salary, location):
    factor = COL_INDEX.get(location, 0.80)
    return salary / factor  # Normalize to SF equivalent

Storing and Tracking Over Time

import sqlite3
from datetime import datetime

def store_salaries(salaries, db_path="salaries.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS salary_data (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            company TEXT, job_title TEXT,
            median_pay REAL, scraped_at TEXT
        )
    """)
    for s in salaries:
        conn.execute(
            "INSERT INTO salary_data (company, job_title, median_pay, scraped_at) VALUES (?, ?, ?, ?)",
            (s["company"], s["job_title"], s.get("normalized_pay"), datetime.utcnow().isoformat())
        )
    conn.commit()
    conn.close()

Scaling Across Companies

When benchmarking dozens of companies, proxy rotation prevents blocks. ThorData residential proxies mimic real user traffic patterns. For aggregating across multiple scraping providers, ScrapeOps monitors success rates and auto-switches.

Use Cases

HR teams: Calibrate offer letters against market rates
Job seekers: Identify if an offer is below market before negotiating
Recruiters: Set competitive salary bands by role and location
Startups: Benchmark against FAANG without expensive compensation surveys

Ethical Considerations

Glassdoor's data is user-contributed and semi-public. Use scraping for research and benchmarking — not to deanonymize contributors. Always respect rate limits and robots.txt directives.

Compensation data is power. With programmatic access, you can build real-time benchmarking tools that outperform expensive surveys. The key is reliable extraction and consistent normalization.

Happy scraping!

DEV Community