How to Scrape Glassdoor Data Responsibly

#scrapeglassdoor

Glassdoor is a goldmine for job seekers and recruiters alike. It's a platform brimming with detailed salary data, employer reviews, and a treasure trove of job listings. But what if you need to pull that data programmatically? You’re in the right place. In this post, we’re diving into how to scrape Glassdoor data using Python and Playwright.

Why Playwright?

Well, Glassdoor's strict anti-scraping measures can shut down traditional scraping tools faster than you can blink. With Playwright, we simulate human behavior, adding proxies and headers, making us virtually invisible to Glassdoor's detection system. Let’s break it down.

Setting Up Your Tools

To get started, you'll need two key libraries: Playwright and lxml (for parsing the HTML). Let’s get them installed:

pip install playwright lxml
playwright install

Step 1: Set Up Your Browser and Start Scraping

First, we’ll use Playwright to open a browser, complete with a proxy, so Glassdoor thinks you’re a real user. This is essential for bypassing those anti-scraping walls. Here's how you can do it:

from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,  # Make the browser visible
            proxy={"server": 'your_proxy', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)
        content = await page.content()  # Grab the page content
        await browser.close()
        return content

# Call the function to retrieve the page content
html_content = await scrape_job_listings()

This code launches the browser, visits the Glassdoor job listings page, and grabs the HTML content. Make sure your proxy is working correctly—this helps you avoid being flagged as a bot.

Step 2: Parsing HTML to Gather Job Details

Once you’ve got the page content, it's time to dig into the data. We’ll use lxml to parse the HTML and extract key job details like the title, location, salary, and company name. Here’s how you do it:

parser = fromstring(html_content)
job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

jobs_data = []
for element in job_posting_elements:
    job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
    job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
    salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
    job_link = element.xpath('.//a[@data-test="job-title"]/@href')[0]
    easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
    company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]

    job_data = {
        'company': company,
        'job_title': job_title,
        'job_location': job_location,
        'job_link': job_link,
        'salary': salary,
        'easy_apply': easy_apply
    }
    jobs_data.append(job_data)

This block extracts the job title, location, salary, company, and whether the position is an “Easy Apply” job. We’re using XPath to pull these elements directly from the HTML.

Step 3: Writing Data to a CSV

Finally, after gathering all the job data, you'll want to store it for later use. The easiest way to do this is by saving it to a CSV file. Here's how you can dump your data into a CSV for analysis:

import csv

with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
    writer.writeheader()
    writer.writerows(jobs_data)

Now you’ve got a neat CSV file with all the job listings.

Putting It All Together

Here’s the complete code you can run:

import csv
from playwright.async_api import async_playwright
from lxml.html import fromstring

async def scrape_job_listings():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,
            proxy={"server": 'your_proxy', 'username': 'your_username', 'password': 'your_password'}
        )
        page = await browser.new_page()
        await page.goto('https://www.glassdoor.com/Job/united-states-software-engineer-jobs-SRCH_IL.0,13_IN1_KO14,31.htm', timeout=60000)

        content = await page.content()
        await browser.close()

        parser = fromstring(content)
        job_posting_elements = parser.xpath('//li[@data-test="jobListing"]')

        jobs_data = []
        for element in job_posting_elements:
            job_title = element.xpath('.//a[@data-test="job-title"]/text()')[0]
            job_location = element.xpath('.//div[@data-test="emp-location"]/text()')[0]
            salary = ' '.join(element.xpath('.//div[@data-test="detailSalary"]/text()')).strip()
            job_link = "https://www.glassdoor.com" + element.xpath('.//a[@data-test="job-title"]/@href')[0]
            easy_apply = bool(element.xpath('.//div[@data-role-variant="featured"]'))
            company = element.xpath('.//span[@class="EmployerProfile_compactEmployerName__LE242"]/text()')[0]

            job_data = {
                'company': company,
                'job_title': job_title,
                'job_location': job_location,
                'job_link': job_link,
                'salary': salary,
                'easy_apply': easy_apply
            }
            jobs_data.append(job_data)

        with open('glassdoor_job_listings.csv', 'w', newline='', encoding='utf-8') as file:
            writer = csv.DictWriter(file, fieldnames=['company', 'job_title', 'job_location', 'job_link', 'salary', 'easy_apply'])
            writer.writeheader()
            writer.writerows(jobs_data)

import asyncio
asyncio.run(scrape_job_listings())

Responsible Scraping

When scraping Glassdoor (or any site), don’t forget to respect their terms of service. Here’s how you can avoid causing trouble:

Adhere Rate Limits: Don’t bombard Glassdoor with too many requests at once. Use delays between requests.
Leverage Rotating Proxies: To avoid getting banned, rotate your proxies regularly.
Comply with Terms: Always check Glassdoor's terms to ensure your actions are within their guidelines.

Final Thoughts

Scraping Glassdoor can give you a wealth of data at your fingertips. But it’s important to approach it responsibly. With Playwright, proxies, and a bit of patience, you can access job listings like a pro, all while staying under the radar.