DEV Community

agenthustler
agenthustler

Posted on

Glassdoor Salary Scraping: Extract Compensation Data and Company Reviews

Salary data and company reviews are among the most sought-after datasets in the job market. Glassdoor has built an empire on crowdsourced compensation data, company reviews, interview experiences, and CEO approval ratings. For HR professionals, recruiters, compensation analysts, and job seekers, having programmatic access to this data is transformative.

In this guide, we'll walk through how to extract salary reports, company reviews, interview data, and CEO ratings from Glassdoor using web scraping techniques and the Apify cloud platform.

Why Glassdoor Data Matters

Glassdoor hosts one of the largest collections of employee-contributed workplace data:

  • Salary Reports: Over 100 million salary reports across every industry and role
  • Company Reviews: Detailed employee reviews with pros, cons, and ratings across multiple dimensions
  • Interview Experiences: Step-by-step interview process descriptions with difficulty ratings
  • CEO Approval Ratings: Leadership ratings that correlate with company performance
  • Benefits Reviews: Detailed breakdowns of company benefits packages

This data powers critical business decisions:

  • HR/Compensation teams benchmark salaries against market rates
  • Recruiters understand candidate expectations before making offers
  • Job seekers negotiate from a position of knowledge
  • Investors use employee sentiment as a leading indicator
  • Researchers study labor market dynamics and workplace trends

Understanding Glassdoor's Data Structure

Glassdoor organizes data around companies, with each company having multiple data sections.

Company Overview

Each company profile at glassdoor.com/Overview/company-overview-{id}.htm contains:

Company Name
Overall Rating (1-5 stars)
Number of Reviews
CEO Name and Approval Rating
Recommend to Friend %
Industry
Company Size (employees)
Revenue Range
Headquarters Location
Founded Year
Company Type (Public, Private, etc.)
Website
Competitors
Enter fullscreen mode Exit fullscreen mode

Salary Data

Salary reports at glassdoor.com/Salary/company-salaries-{id}.htm include:

Job Title
Base Pay (range: low, average, high)
Additional Pay (bonuses, stock, tips)
Total Pay Range
Number of Salaries Reported
Pay by Experience Level
Pay by Location
Pay Trend (year over year)
Enter fullscreen mode Exit fullscreen mode

Company Reviews

Reviews at glassdoor.com/Reviews/company-reviews-{id}.htm contain:

Overall Rating (1-5)
Title/Summary
Pros (text)
Cons (text)
Advice to Management (text)
Rating Breakdown:
  - Work/Life Balance
  - Culture & Values
  - Diversity & Inclusion
  - Career Opportunities
  - Compensation & Benefits
  - Senior Management
Employment Status (Current/Former)
Job Title
Location
Date Posted
Helpful Count
Enter fullscreen mode Exit fullscreen mode

Interview Data

Interview reviews at glassdoor.com/Interview/company-interview-{id}.htm:

Job Title Applied For
Application Method
Interview Experience (Positive/Neutral/Negative)
Interview Difficulty (1-5)
Offer Status (Accepted/Declined/No Offer)
Interview Questions
Interview Process Description
Date of Interview
Enter fullscreen mode Exit fullscreen mode

Technical Approach

Glassdoor is a JavaScript-heavy application with sophisticated anti-bot measures. A browser-based approach with proper session management is essential.

Project Setup

mkdir glassdoor-scraper
cd glassdoor-scraper
npm init -y
npm install crawlee puppeteer apify
Enter fullscreen mode Exit fullscreen mode

Salary Data Scraper

Here's a comprehensive scraper for extracting salary information:

import { PuppeteerCrawler, Dataset } from 'crawlee';

const crawler = new PuppeteerCrawler({
    maxConcurrency: 1,
    maxRequestsPerMinute: 10,
    navigationTimeoutSecs: 60,

    async requestHandler({ page, request, log }) {
        const { label } = request.userData;

        if (label === 'SALARY_LIST') {
            log.info('Scraping salary page: ' + request.url);

            await page.waitForSelector('[data-test="salaries-list"]', { timeout: 20000 });

            const salaries = await page.evaluate(() => {
                const rows = document.querySelectorAll('[data-test="salary-row"]');
                return Array.from(rows).map(row => {
                    const jobTitle = row.querySelector('[data-test="salary-job-title"]')?.textContent?.trim();
                    const basePay = row.querySelector('[data-test="base-pay-amount"]')?.textContent?.trim();
                    const additionalPay = row.querySelector('[data-test="additional-pay"]')?.textContent?.trim();
                    const totalPay = row.querySelector('[data-test="total-pay-amount"]')?.textContent?.trim();
                    const numSalaries = row.querySelector('[data-test="num-salaries"]')?.textContent?.trim();
                    const payRange = row.querySelector('[data-test="pay-range"]')?.textContent?.trim();
                    const detailUrl = row.querySelector('a[data-test="salary-detail-link"]')?.href;

                    return { jobTitle, basePay, additionalPay, totalPay, numSalaries, payRange, detailUrl };
                });
            });

            const companyName = await page.$eval(
                '[data-test="employer-name"]',
                el => el.textContent?.trim()
            ).catch(() => 'Unknown');

            for (const salary of salaries) {
                await Dataset.pushData({
                    ...salary,
                    companyName,
                    sourceUrl: request.url,
                    scrapedAt: new Date().toISOString()
                });
            }

            const nextPage = await page.$('[data-test="pagination-next"]:not([disabled])');
            if (nextPage) {
                const nextUrl = await nextPage.evaluate(el => el.href);
                await crawler.addRequests([{
                    url: nextUrl,
                    userData: { label: 'SALARY_LIST' }
                }]);
            }

            log.info('Extracted ' + salaries.length + ' salary entries for ' + companyName);
        }
    }
});

await crawler.run([{
    url: 'https://www.glassdoor.com/Salary/Google-Salaries-E9079.htm',
    userData: { label: 'SALARY_LIST' }
}]);
Enter fullscreen mode Exit fullscreen mode

Company Reviews Scraper

Extracting detailed company reviews with rating breakdowns:

import { PuppeteerCrawler, Dataset } from 'crawlee';

const reviewsCrawler = new PuppeteerCrawler({
    maxConcurrency: 1,
    maxRequestsPerMinute: 8,

    async requestHandler({ page, request, log }) {
        log.info('Scraping reviews: ' + request.url);

        await page.waitForSelector('[data-test="review-list"]', { timeout: 20000 });

        const companyRatings = await page.evaluate(() => {
            const ratingSection = document.querySelector('[data-test="rating-info"]');
            if (!ratingSection) return {};

            return {
                overallRating: ratingSection.querySelector('[data-test="rating-headline"]')?.textContent?.trim(),
                recommendPercent: ratingSection.querySelector('[data-test="recommend-pct"]')?.textContent?.trim(),
                ceoApproval: ratingSection.querySelector('[data-test="ceo-approval"]')?.textContent?.trim(),
                ceoName: ratingSection.querySelector('[data-test="ceo-name"]')?.textContent?.trim(),
            };
        });

        const reviews = await page.evaluate(() => {
            const reviewElements = document.querySelectorAll('[data-test="review-list-item"]');

            return Array.from(reviewElements).map(review => {
                const subRatings = {};
                const ratingBars = review.querySelectorAll('[class*="subRating"]');
                ratingBars.forEach(bar => {
                    const label = bar.querySelector('[class*="ratingLabel"]')?.textContent?.trim();
                    const value = bar.querySelector('[class*="ratingValue"]')?.textContent?.trim();
                    if (label && value) subRatings[label] = parseFloat(value);
                });

                return {
                    rating: review.querySelector('[class*="ratingNumber"]')?.textContent?.trim(),
                    title: review.querySelector('[data-test="review-title"]')?.textContent?.trim(),
                    pros: review.querySelector('[data-test="review-pros"]')?.textContent?.trim(),
                    cons: review.querySelector('[data-test="review-cons"]')?.textContent?.trim(),
                    advice: review.querySelector('[data-test="review-advice"]')?.textContent?.trim(),
                    employeeStatus: review.querySelector('[data-test="employee-status"]')?.textContent?.trim(),
                    jobTitle: review.querySelector('[data-test="review-job-title"]')?.textContent?.trim(),
                    location: review.querySelector('[data-test="review-location"]')?.textContent?.trim(),
                    date: review.querySelector('[data-test="review-date"]')?.textContent?.trim(),
                    helpfulCount: review.querySelector('[data-test="helpful-count"]')?.textContent?.trim(),
                    subRatings
                };
            });
        });

        for (const review of reviews) {
            await Dataset.pushData({
                ...review,
                companyRatings,
                sourceUrl: request.url,
                scrapedAt: new Date().toISOString()
            });
        }

        log.info('Extracted ' + reviews.length + ' reviews');

        const nextBtn = await page.$('[data-test="pagination-next"]:not([disabled])');
        if (nextBtn) {
            const nextUrl = await nextBtn.evaluate(el => el.href);
            await reviewsCrawler.addRequests([{
                url: nextUrl,
                userData: { label: 'REVIEWS' }
            }]);
        }
    }
});

await reviewsCrawler.run([
    'https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm'
]);
Enter fullscreen mode Exit fullscreen mode

Interview Experience Scraper

import { PuppeteerCrawler, Dataset } from 'crawlee';

const interviewCrawler = new PuppeteerCrawler({
    maxConcurrency: 1,
    maxRequestsPerMinute: 8,

    async requestHandler({ page, request, log }) {
        log.info('Scraping interviews: ' + request.url);

        await page.waitForSelector('[data-test="interview-list"]', { timeout: 20000 });

        const interviewStats = await page.evaluate(() => {
            return {
                experienceBreakdown: {
                    positive: document.querySelector('[data-test="positive-pct"]')?.textContent?.trim(),
                    neutral: document.querySelector('[data-test="neutral-pct"]')?.textContent?.trim(),
                    negative: document.querySelector('[data-test="negative-pct"]')?.textContent?.trim(),
                },
                averageDifficulty: document.querySelector('[data-test="avg-difficulty"]')?.textContent?.trim(),
                applicationSources: Array.from(
                    document.querySelectorAll('[data-test="app-source"]')
                ).map(el => ({
                    source: el.querySelector('.source-name')?.textContent?.trim(),
                    percentage: el.querySelector('.source-pct')?.textContent?.trim()
                }))
            };
        });

        const interviews = await page.evaluate(() => {
            return Array.from(
                document.querySelectorAll('[data-test="interview-item"]')
            ).map(item => ({
                jobTitle: item.querySelector('[data-test="interview-job-title"]')?.textContent?.trim(),
                date: item.querySelector('[data-test="interview-date"]')?.textContent?.trim(),
                experience: item.querySelector('[data-test="interview-experience"]')?.textContent?.trim(),
                difficulty: item.querySelector('[data-test="difficulty-rating"]')?.textContent?.trim(),
                offer: item.querySelector('[data-test="offer-status"]')?.textContent?.trim(),
                applicationMethod: item.querySelector('[data-test="app-method"]')?.textContent?.trim(),
                process: item.querySelector('[data-test="interview-process"]')?.textContent?.trim(),
                questions: Array.from(
                    item.querySelectorAll('[data-test="interview-question"]')
                ).map(q => q.textContent?.trim()),
                helpfulCount: item.querySelector('[data-test="helpful-count"]')?.textContent?.trim()
            }));
        });

        for (const interview of interviews) {
            await Dataset.pushData({
                ...interview,
                interviewStats,
                sourceUrl: request.url,
                scrapedAt: new Date().toISOString()
            });
        }

        log.info('Extracted ' + interviews.length + ' interview reviews');
    }
});

await interviewCrawler.run([
    'https://www.glassdoor.com/Interview/Google-Interview-Questions-E9079.htm'
]);
Enter fullscreen mode Exit fullscreen mode

Deploying as an Apify Actor

Here's a complete Apify Actor that combines all scrapers with configurable input:

import { Actor } from 'apify';
import { PuppeteerCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput() ?? {};
const {
    companyUrl = '',
    dataTypes = ['salaries', 'reviews', 'interviews'],
    maxPages = 10,
} = input;

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});

const startUrls = [];

if (dataTypes.includes('salaries') && companyUrl) {
    const salaryUrl = companyUrl.replace('/Overview/', '/Salary/').replace('-overview-', '-salaries-');
    startUrls.push({ url: salaryUrl, userData: { label: 'SALARIES', page: 1 } });
}

if (dataTypes.includes('reviews') && companyUrl) {
    const reviewUrl = companyUrl.replace('/Overview/', '/Reviews/').replace('-overview-', '-reviews-');
    startUrls.push({ url: reviewUrl, userData: { label: 'REVIEWS', page: 1 } });
}

if (dataTypes.includes('interviews') && companyUrl) {
    const interviewUrl = companyUrl.replace('/Overview/', '/Interview/').replace('-overview-', '-interview-');
    startUrls.push({ url: interviewUrl, userData: { label: 'INTERVIEWS', page: 1 } });
}

const crawler = new PuppeteerCrawler({
    proxyConfiguration,
    maxConcurrency: 1,
    maxRequestsPerMinute: 8,
    navigationTimeoutSecs: 60,
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox', '--disable-setuid-sandbox']
        }
    },

    async requestHandler({ page, request, log }) {
        const { label, page: pageNum } = request.userData;
        log.info('Processing ' + label + ' page ' + pageNum + ': ' + request.url);

        switch (label) {
            case 'SALARIES':
                await scrapeSalaries(page, request, log, maxPages);
                break;
            case 'REVIEWS':
                await scrapeReviews(page, request, log, maxPages);
                break;
            case 'INTERVIEWS':
                await scrapeInterviews(page, request, log, maxPages);
                break;
        }
    },

    failedRequestHandler({ request, log }) {
        log.error('Failed: ' + request.url + ' - ' + request.errorMessages);
    }
});

async function scrapeSalaries(page, request, log, maxPages) {
    await page.waitForSelector('[data-test="salaries-list"]', { timeout: 20000 });

    const salaries = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('[data-test="salary-row"]')).map(row => ({
            type: 'salary',
            jobTitle: row.querySelector('[data-test="salary-job-title"]')?.textContent?.trim(),
            basePay: row.querySelector('[data-test="base-pay-amount"]')?.textContent?.trim(),
            totalPay: row.querySelector('[data-test="total-pay-amount"]')?.textContent?.trim(),
            numReports: row.querySelector('[data-test="num-salaries"]')?.textContent?.trim(),
        }));
    });

    await Dataset.pushData(salaries.map(s => ({
        ...s, sourceUrl: request.url, scrapedAt: new Date().toISOString()
    })));

    log.info('Got ' + salaries.length + ' salary entries');
}

async function scrapeReviews(page, request, log, maxPages) {
    await page.waitForSelector('[data-test="review-list"]', { timeout: 20000 });

    const reviews = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('[data-test="review-list-item"]')).map(r => ({
            type: 'review',
            rating: r.querySelector('[class*="ratingNumber"]')?.textContent?.trim(),
            title: r.querySelector('[data-test="review-title"]')?.textContent?.trim(),
            pros: r.querySelector('[data-test="review-pros"]')?.textContent?.trim(),
            cons: r.querySelector('[data-test="review-cons"]')?.textContent?.trim(),
            jobTitle: r.querySelector('[data-test="review-job-title"]')?.textContent?.trim(),
            date: r.querySelector('[data-test="review-date"]')?.textContent?.trim(),
        }));
    });

    await Dataset.pushData(reviews.map(r => ({
        ...r, sourceUrl: request.url, scrapedAt: new Date().toISOString()
    })));

    log.info('Got ' + reviews.length + ' reviews');
}

async function scrapeInterviews(page, request, log, maxPages) {
    await page.waitForSelector('[data-test="interview-list"]', { timeout: 20000 });

    const interviews = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('[data-test="interview-item"]')).map(item => ({
            type: 'interview',
            jobTitle: item.querySelector('[data-test="interview-job-title"]')?.textContent?.trim(),
            experience: item.querySelector('[data-test="interview-experience"]')?.textContent?.trim(),
            difficulty: item.querySelector('[data-test="difficulty-rating"]')?.textContent?.trim(),
            offer: item.querySelector('[data-test="offer-status"]')?.textContent?.trim(),
            process: item.querySelector('[data-test="interview-process"]')?.textContent?.trim(),
            questions: Array.from(
                item.querySelectorAll('[data-test="interview-question"]')
            ).map(q => q.textContent?.trim()),
            date: item.querySelector('[data-test="interview-date"]')?.textContent?.trim(),
        }));
    });

    await Dataset.pushData(interviews.map(i => ({
        ...i, sourceUrl: request.url, scrapedAt: new Date().toISOString()
    })));

    log.info('Got ' + interviews.length + ' interviews');
}

await crawler.run(startUrls);
await Actor.exit();
Enter fullscreen mode Exit fullscreen mode

Data Analysis Examples

Once you've collected the data, here are practical analyses you can perform:

Salary Benchmarking

function benchmarkSalaries(salaryData, targetRole) {
    const roleData = salaryData.filter(s =>
        s.jobTitle.toLowerCase().includes(targetRole.toLowerCase())
    );

    if (roleData.length === 0) return null;

    const basePays = roleData
        .map(s => parseCurrency(s.basePay))
        .filter(v => v > 0)
        .sort((a, b) => a - b);

    return {
        role: targetRole,
        sampleSize: basePays.length,
        percentile25: basePays[Math.floor(basePays.length * 0.25)],
        median: basePays[Math.floor(basePays.length * 0.5)],
        percentile75: basePays[Math.floor(basePays.length * 0.75)],
        average: basePays.reduce((a, b) => a + b, 0) / basePays.length,
    };
}

function parseCurrency(str) {
    if (!str) return 0;
    return parseInt(str.replace(/[$,K]/gi, '')) * (str.toLowerCase().includes('k') ? 1000 : 1);
}
Enter fullscreen mode Exit fullscreen mode

Sentiment Analysis on Reviews

function analyzeReviewSentiment(reviews) {
    const ratings = reviews.map(r => parseFloat(r.rating));
    const avgRating = ratings.reduce((a, b) => a + b, 0) / ratings.length;

    const trends = {};
    reviews.forEach(review => {
        const month = review.date?.substring(0, 7);
        if (month) {
            if (!trends[month]) trends[month] = { sum: 0, count: 0 };
            trends[month].sum += parseFloat(review.rating);
            trends[month].count++;
        }
    });

    const monthlyAvg = Object.entries(trends)
        .sort(([a], [b]) => a.localeCompare(b))
        .map(([month, data]) => ({
            month,
            averageRating: (data.sum / data.count).toFixed(2),
            reviewCount: data.count
        }));

    const prosWords = extractKeyPhrases(reviews.map(r => r.pros).filter(Boolean));
    const consWords = extractKeyPhrases(reviews.map(r => r.cons).filter(Boolean));

    return {
        averageRating: avgRating.toFixed(2),
        totalReviews: reviews.length,
        monthlyTrend: monthlyAvg,
        topProsThemes: prosWords.slice(0, 10),
        topConsThemes: consWords.slice(0, 10)
    };
}

function extractKeyPhrases(texts) {
    const wordCount = {};
    const stopWords = new Set([
        'the', 'a', 'an', 'is', 'are', 'was', 'and', 'or', 'but',
        'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'very',
        'that', 'this', 'it', 'not', 'no', 'can', 'you', 'your',
        'they', 'we', 'i'
    ]);

    texts.forEach(text => {
        const words = text.toLowerCase()
            .replace(/[^a-z\s]/g, '')
            .split(/\s+/)
            .filter(w => w.length > 3 && !stopWords.has(w));
        words.forEach(w => { wordCount[w] = (wordCount[w] || 0) + 1; });
    });

    return Object.entries(wordCount)
        .sort(([, a], [, b]) => b - a)
        .map(([word, count]) => ({ word, count }));
}
Enter fullscreen mode Exit fullscreen mode

Best Practices and Ethical Guidelines

Rate Limiting is Critical

Glassdoor actively monitors for automated access. Always implement conservative rate limits:

const crawler = new PuppeteerCrawler({
    maxConcurrency: 1,
    maxRequestsPerMinute: 8,
    requestHandlerTimeoutSecs: 120,
    navigationTimeoutSecs: 60,
});
Enter fullscreen mode Exit fullscreen mode

Session Management

Glassdoor requires login for some data. Use persistent sessions:

async function setupSession(page) {
    await page.setViewport({ width: 1920, height: 1080 });

    await page.setUserAgent(
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    );

    await page.waitForTimeout(2000 + Math.random() * 3000);
}
Enter fullscreen mode Exit fullscreen mode

Proxy Rotation

Using residential proxies is highly recommended for Glassdoor:

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});
Enter fullscreen mode Exit fullscreen mode

Data Privacy Compliance

  • Never scrape individual user profiles or personal information
  • Aggregate salary data rather than storing individual reports
  • Comply with GDPR if processing data from EU users
  • Use the data for legitimate research and benchmarking purposes
  • Do not republish raw review text without proper attribution

Handling CAPTCHAs

Glassdoor may present CAPTCHAs during heavy scraping. Strategies to minimize this:

  1. Low request rates: Stay under 10 requests per minute
  2. Residential proxies: Appear as regular users
  3. Session persistence: Reuse browser sessions
  4. Human-like behavior: Add random delays and mouse movements

Real-World Applications

For Compensation Teams

Build a dashboard that tracks salary trends for your key roles across competitors. Update weekly to stay ahead of market movements. This helps ensure your offers are competitive without overpaying.

For Recruiters

Aggregate review sentiment data to pitch candidates on culture. When a candidate mentions they care about work-life balance, having data showing your client company rates 4.2/5.0 on that dimension is powerful.

For Due Diligence

Investors increasingly use employee sentiment as a signal. A company with declining review scores and increasing cons around leadership may be heading for trouble, even if revenue looks healthy.

For Job Seekers

Build a personalized dashboard that tracks companies you're interested in. Monitor for new salary reports in your role, read recent interview experiences, and track CEO approval trends over time.

Conclusion

Glassdoor contains some of the most valuable workplace data on the internet. By using Puppeteer-based scraping with Crawlee and deploying on Apify's infrastructure, you can build reliable data pipelines that power salary benchmarking, sentiment analysis, and competitive intelligence.

Remember to always scrape responsibly: respect rate limits, use the data ethically, and comply with privacy regulations. The goal is to build sustainable data collection workflows that provide long-term value, not to overwhelm servers with aggressive scraping.

Start with a single company and data type, validate your selectors, and scale gradually. The code examples in this guide give you a solid foundation to build upon for your specific use case. Whether you're benchmarking compensation packages, monitoring company sentiment, or preparing for your next interview, programmatic access to Glassdoor data gives you an information advantage that manual browsing simply cannot match.

Top comments (0)