DEV Community

How to Scrape Glassdoor Data and Extract Job Insights

Ever wonder how companies track job market trends or how data scientists pull the most relevant job postings from sites like Glassdoor? Here's your chance to learn how to scrape Glassdoor data, using Python and Playwright, while bypassing anti-bot measures. This guide will walk you through the process—providing you with actionable steps to get real-time job listings and crucial employment data straight from Glassdoor.

Why Playwright is Your Best Friend

Glassdoor’s anti-scraping measures are no joke. Requests from traditional scraping libraries (like requests or BeautifulSoup) often result in CAPTCHA challenges or even IP bans. That's where Playwright comes in. It’s like giving your Python script a human disguise. Playwright simulates actual browser interactions, and with a little help from proxies, we can fly under the radar while scraping the data we need.
Let’s break this down—step-by-step—so you can start scraping Glassdoor like a pro.

Step 1: Set Up Playwright & Install Dependencies

Before diving into scraping, you’ll need to install two key libraries: Playwright (to handle browser automation) and lxml (for parsing HTML).
Here’s how you get set up:

pip install playwright lxml
playwright install
Enter fullscreen mode Exit fullscreen mode

With this, you'll be ready to launch your first scraping session.

Step 2: Setting Up Your Browser with Proxies

We’re not just scraping for fun. We need to mimic a real user to avoid detection. Playwright allows us to open a real browser and connect through a proxy. This is essential when accessing Glassdoor, as it will help avoid getting flagged.
Here's the basic setup:

from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": "proxy_address", 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/jobs/job-listing', timeout=60000)
        content = await page.content()
        await browser.close()
        return content

html_content = await scrape_job_listings()
Enter fullscreen mode Exit fullscreen mode

Here, we launch Chromium in non-headless mode, meaning the browser window will remain open. That’s a helpful trick for mimicking real user behavior. Make sure to replace the proxy details with your own.

Step 3: Extracting Data from Glassdoor Listings

Once the page loads, it’s time to pull the job data. With lxml, you can parse the HTML and pull out specific details like job titles, salaries, locations, and more.
Here’s how you do it:

parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]

    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)
Enter fullscreen mode Exit fullscreen mode

What’s happening here? We loop through each job posting, extract the key details like the job title, salary, company name, and location, then store them in a dictionary.

Step 4: Saving Job Data as a CSV File

You’ve got the data. Now it’s time to store it. We’ll save the job listings into a CSV for easy analysis and processing. Here’s the code:

import csv

with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)
Enter fullscreen mode Exit fullscreen mode

This will create a CSV file with columns for each job attribute. You can easily open and manipulate the data in Excel or a database.

Complete Script

Here's the complete script that integrates all steps:

import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": "proxy_address", 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs', timeout=60000)

        content = await page.content()
        await browser.close()

        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]

            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)

        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)

import asyncio
asyncio.run(scrape_job_listings())
Enter fullscreen mode Exit fullscreen mode

Keep It Ethical

When scraping Glassdoor, or any website, it’s crucial to follow ethical guidelines:

  1. Comply with rate limits: Don’t bombard the server with requests—use time delays between them.
  2. Utilize Rotating Proxies: To avoid IP bans, make sure you rotate your IPs regularly.
  3. Abide by terms of service: Scraping isn’t a free pass to violate site terms. Always read and respect the rules.

Final Thoughts

With this guide, you now have the power to extract valuable employment data from Glassdoor. Whether you're tracking trends or analyzing job market data, you can pull insights with ease and accuracy. Unlock the data that drives decisions.

Image of Datadog

Create and maintain end-to-end frontend tests

Learn best practices on creating frontend tests, testing on-premise apps, integrating tests into your CI/CD pipeline, and using Datadog’s testing tunnel.

Download The Guide

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More