Salary transparency is reshaping hiring. Companies benchmark compensation, job seekers negotiate better, and recruiters calibrate offers — all using salary data. Glassdoor holds one of the richest salary datasets on the web. Here's how to extract it programmatically.
Why Glassdoor Salary Data?
Glassdoor has self-reported salary data across thousands of companies, roles, and locations. Unlike BLS data (lagging 1-2 years), Glassdoor reflects current market conditions. For building compensation tools, it's invaluable.
Technical Challenges
Glassdoor uses aggressive anti-bot measures:
- Required login for salary views
- Cloudflare protection
- Dynamic JavaScript rendering
- Session-based content gating
Setting Up the Scraper
import requests
from bs4 import BeautifulSoup
import json
import time
import csv
API_KEY = "YOUR_SCRAPERAPI_KEY"
def scrape_glassdoor_salaries(company_slug, num_pages=5):
salaries = []
for page in range(1, num_pages + 1):
url = f"https://www.glassdoor.com/Salary/{company_slug}-Salaries-E1234_P{page}.htm"
params = {
"api_key": API_KEY,
"url": url,
"render": "true",
"country_code": "us"
}
response = requests.get(
"https://api.scraperapi.com",
params=params,
timeout=60
)
if response.status_code == 200:
salaries.extend(parse_salary_page(response.text))
time.sleep(2) # Be respectful
return salaries
ScraperAPI handles Cloudflare bypass and JavaScript rendering, which is essential for Glassdoor's dynamic salary cards.
Parsing Salary Cards
def parse_salary_page(html):
soup = BeautifulSoup(html, "html.parser")
results = []
salary_cards = soup.find_all("div", {"data-test": "salaries-list-item"})
for card in salary_cards:
title_el = card.find("a", {"data-test": "job-title"})
pay_el = card.find("div", {"data-test": "salary-amount"})
range_el = card.find("span", {"data-test": "salary-range"})
count_el = card.find("div", {"data-test": "salary-count"})
if title_el and pay_el:
results.append({
"job_title": title_el.get_text(strip=True),
"median_pay": pay_el.get_text(strip=True),
"pay_range": range_el.get_text(strip=True) if range_el else None,
"sample_count": count_el.get_text(strip=True) if count_el else None,
})
return results
Building a Compensation Benchmark
import pandas as pd
import re
def normalize_salary(salary_str):
"""Convert '$120K/yr' to 120000"""
cleaned = re.sub(r"[^\d.]", "", salary_str.replace("K", "000"))
return float(cleaned) if cleaned else None
def build_benchmark(companies, role):
all_data = []
for company in companies:
salaries = scrape_glassdoor_salaries(company)
for s in salaries:
if role.lower() in s["job_title"].lower():
s["company"] = company
s["normalized_pay"] = normalize_salary(s["median_pay"])
all_data.append(s)
df = pd.DataFrame(all_data)
benchmark = df.groupby("company")["normalized_pay"].agg(
["median", "min", "max", "count"]
).round(0)
return benchmark
# Example: benchmark Software Engineer salaries
companies = ["Google", "Meta", "Amazon", "Microsoft", "Apple"]
benchmark = build_benchmark(companies, "Software Engineer")
print(benchmark.to_string())
Adding Location Normalization
Salaries vary dramatically by location. Add cost-of-living adjustments:
COL_INDEX = {
"San Francisco": 1.0,
"New York": 0.95,
"Seattle": 0.88,
"Austin": 0.72,
"Remote": 0.80,
}
def adjust_for_location(salary, location):
factor = COL_INDEX.get(location, 0.80)
return salary / factor # Normalize to SF equivalent
Storing and Tracking Over Time
import sqlite3
from datetime import datetime
def store_salaries(salaries, db_path="salaries.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS salary_data (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company TEXT, job_title TEXT,
median_pay REAL, scraped_at TEXT
)
""")
for s in salaries:
conn.execute(
"INSERT INTO salary_data (company, job_title, median_pay, scraped_at) VALUES (?, ?, ?, ?)",
(s["company"], s["job_title"], s.get("normalized_pay"), datetime.utcnow().isoformat())
)
conn.commit()
conn.close()
Scaling Across Companies
When benchmarking dozens of companies, proxy rotation prevents blocks. ThorData residential proxies mimic real user traffic patterns. For aggregating across multiple scraping providers, ScrapeOps monitors success rates and auto-switches.
Use Cases
- HR teams: Calibrate offer letters against market rates
- Job seekers: Identify if an offer is below market before negotiating
- Recruiters: Set competitive salary bands by role and location
- Startups: Benchmark against FAANG without expensive compensation surveys
Ethical Considerations
Glassdoor's data is user-contributed and semi-public. Use scraping for research and benchmarking — not to deanonymize contributors. Always respect rate limits and robots.txt directives.
Compensation data is power. With programmatic access, you can build real-time benchmarking tools that outperform expensive surveys. The key is reliable extraction and consistent normalization.
Happy scraping!
Top comments (0)