DEV Community

Vic
Vic

Posted on

Tutorial: Web Scraping LinkedIn Jobs with Playwright.

Tutorial: Web Scraping LinkedIn Jobs with Playwright.

In this tutorial, we will explore a Python script that uses Playwright and Scrapy to scrape job listings from LinkedIn. The script will log in to LinkedIn, apply various search parameters, and scrape job details. We'll then save the scraped data to a CSV file.

Overview

The Python script is designed to perform the following tasks:

  1. Log in to LinkedIn using provided credentials.
  2. Load search parameters from a YAML configuration file.
  3. Scrape job listings based on the search parameters.
  4. Save the scraped job data to a CSV file.

Requirements

Before proceeding with the tutorial, make sure you have the following installed:

  • Python 3.x
  • playwright-sync library (playwright and playwright-sync should be installed)
  • scrapy library
  • pandas library
  • rich library
  • click library
  • YAML file with LinkedIn login credentials and search parameters (config.yaml)

How to Use the Script

  1. Prepare the YAML Configuration File:
    • Create a YAML file (e.g., config.yaml) with the following structure:
email: YOUR_LINKEDIN_EMAIL
password: YOUR_LINKEDIN_PASSWORD
params:
  - KEYWORD_SEARCH_1:
      keywords: "KEYWORD_1 KEYWORD_2"
      location: "CITY_NAME"
  - KEYWORD_SEARCH_2:
      keywords: "KEYWORD_3"
      location: "CITY_NAME"
Enter fullscreen mode Exit fullscreen mode

Replace YOUR_LINKEDIN_EMAIL and YOUR_LINKEDIN_PASSWORD with your LinkedIn credentials. Add additional params entries to define different job searches. Each params entry should have a unique key (e.g., KEYWORD_SEARCH_1, KEYWORD_SEARCH_2) and include the keywords and location for the search.

  1. Open a Terminal or Command Prompt:
    • Navigate to the directory where the script and the config.yaml file are located.
  2. Run the Script:
    • To execute the script, use the following command:
python your_script_name.py
Enter fullscreen mode Exit fullscreen mode
  1. Replace your_script_name.py with the actual name of the Python script containing the provided code.
  2. Command-Line Options:
    The script supports several command-line options that you can pass when running the script:

    • -config: Specifies the path to the YAML config file. If not provided, it defaults to 'config.yaml'.
    • -headless/--no-headless: Specifies whether to run the browser in headless mode or not. By default, it runs in headless mode (-headless).
    • -last24h: An optional flag that, if provided, instructs the script to filter jobs posted within the last 24 hours only.

    Here's an example of how you can use command-line options:

    python your_script_name.py --config path/to/custom_config.yaml --no-headless --last24h
    

    In this example, we specified a custom config file (path/to/custom_config.yaml), set the browser to run in non-headless mode (--no-headless), and enabled the filter for jobs posted within the last 24 hours (--last24h).


Step 1: Import Required Libraries

Let's start by importing the required libraries in the Python script:

import yaml
from urllib.parse import urlencode, urljoin
from playwright.sync_api import sync_playwright
from scrapy import Selector
from dataclasses import dataclass
import pandas as pd
import logging
import re
from rich.logging import RichHandler
import click
import sys
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure Logging

The script uses the logging module for better log outputs. It also uses the rich library to enhance the logging with colors and additional formatting. The logging configuration is done as follows:

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
rich_handler = RichHandler(rich_tracebacks=True)
logging.getLogger().handlers = [rich_handler]
Enter fullscreen mode Exit fullscreen mode

Step 3: Define Dataclass for Job Details

The script uses a dataclass named Job to store information about each job listing. The Job dataclass contains the following attributes: url, job_title, job_id, company_name, company_image, and job_location.

@dataclass
class Job:
    url: str
    job_title: str
    job_id: int
    company_name: str
    company_image: str
    job_location: str
Enter fullscreen mode Exit fullscreen mode

Step 4: Define Global Variables

The script uses a global variable PAGE_NUMBER to keep track of the current page number while scraping job listings.

PAGE_NUMBER = 1
Enter fullscreen mode Exit fullscreen mode

Step 5: Implement LinkedIn Login

Next, we implement the login_to_linkedin function, which logs in to LinkedIn using provided credentials. If the script runs in headless mode and encounters a captcha challenge, it will abort the process. The function accepts the page object (Playwright page object), email, password, and headless boolean parameter.

def login_to_linkedin(page, email, password, headless):
    # Go to the LinkedIn login page
    page.goto("https://www.linkedin.com/uas/login")
    page.wait_for_load_state('load')

    # Fill in the login credentials and click the login button
    page.get_by_label("Email o teléfono").click()
    page.get_by_label("Email o teléfono").fill(email)
    page.get_by_label("Contraseña").click()
    page.get_by_label("Contraseña").fill(password)
    page.locator("#organic-div form").get_by_role("button", name="Iniciar sesión").click()
    page.wait_for_load_state('load')

    if "checkpoint/challenge" in page.url and not headless:
        logger.warning("Captcha page! Human intervention is needed!")
        # Polling loop to check if captcha is solved
        while True:
            if "checkpoint/challenge" not in page.url:
                logger.info("Captcha solved. Continuing with the rest of the process.")
                break
            page.wait_for_timeout(2000)  # Wait for 2 seconds before polling again
        page.wait_for_timeout(5000)
    else:
        logger.error("Captcha page! Aborting due to headless mode...")
        sys.exit(1)
Enter fullscreen mode Exit fullscreen mode

Step 6: Scrape Jobs

The scrape_jobs function is responsible for scraping job listings based on provided search parameters. It accepts the page object (Playwright page object), params (search parameters), and last24h (boolean flag to scrape only jobs from the last 24 hours).

def scrape_jobs(page, params, last24h):
    global PAGE_NUMBER
    main_url = "https://www.linkedin.com/jobs/"

    base_url = 'https://www.linkedin.com/jobs/search/'
    url = f'{base_url}?{urlencode(params)}'

    # List to store job data
    job_list = []

    # Go to the search results page
    page.goto(url)
    page.wait_for_load_state('load')

    # Apply the "last 24 hours" filter if required
    if last24h:
        page.get_by_role("button", name="Filtro «Fecha de publicación». Al hacer clic en este botón, se muestran todas las opciones del filtro «Fecha de publicación».").click()
        page.locator("label").filter(has_text="Últimas 24 horas Filtrar por «Últimas 24 horas»").click()
        pattern = r"Aplicar el filtro actual para mostrar (\d+\+?) resultados"
        page.get_by_role("button", name=re.compile(pattern, re.IGNORECASE)).click()

    # Loop through the job listings on the page and scrape details
    while True:
        page.locator("div.jobs-search-results-list").click()
        for _ in range(15):
            page.mouse.wheel(0, 250)
        page.wait_for_timeout(3000)
        response = Selector(text=page.content())

        jobs = response.css("ul.scaffold-layout__list-container li.ember-view")
        for job in jobs:
            job_info = Job(
                url=urljoin(main_url, job.css("a::attr(href)").get()) if job.css("a::attr(href)").get() else None,
                job_title=job.css("a::attr(aria-label)").get(),
                job_id=job.css("::attr(data-occludable-job-id)").get(),
                company_name=" ".join(job.css("img ::attr(alt)").get().split(" ")[2::]) if job.css("img ::attr(alt)").get() else None,
                company_image=job.css("img ::attr(src)").get(),
                job_location=" ".join(job.css(".job-card-container__metadata-item ::text").getall()) if job.css(
                    ".job-card-container__metadata-item ::text").get() else None
            )
            job_list.append(job_info)
            logger.info(f"Scraped job: {job_info.job_title}")

        # Check if there is a "Next" button and click it to move to the next page
        try:
            PAGE_NUMBER += 1
            page.get_by_role("button", name=f"Página {PAGE_NUMBER}",
Enter fullscreen mode Exit fullscreen mode

Step 7: Defining the Command-line Interface and Main Function

The script uses click to create a command-line interface for the scraping process. The main function is named main:

@click.command()
@click.option('--config', type=click.Path(exists=True), default='config.yaml', help='Path to the YAML config file')
@click.option('--headless/--no-headless', default=True, help='Run the browser in headless mode or not')
@click.option('--last24h', is_flag=True, default=False, help='Make the browser go for last 24h jobs only')
def main(config, headless, last24h):
    # Load the YAML file with the list of search parameters
    with open('config.yaml', 'r') as f:
        data = yaml.safe_load(f)

    email = data.get("email")
    password = data.get("password")
    params_list = data.get("params")

    # Start the browser
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=headless)
        page = browser.new_page(locale="es-ES")  # Changed the locale to English

        # Login to LinkedIn once
        login_to_linkedin(page, email, password, headless)

        all_jobs = []
        for params in params_list:
            logger.info(f"Crawl starting... Params: {params}")
            jobs = scrape_jobs(page, params, last24h)
            all_jobs.extend(jobs)

        # Create a DataFrame from the combined job_list
        df = pd.DataFrame([job.__dict__ for job in all_jobs])

        # Save DataFrame to a CSV file
        csv_file_path = 'jobs_data.csv'
        df.to_csv(csv_file_path, index=False)

        # Log the number of jobs scraped and saved
        logger.info(f"Scraped {len(all_jobs)} jobs and saved to jobs_data.csv")

        browser.close()

if __name__ == '__main__':
    main()
Enter fullscreen mode Exit fullscreen mode

If you liked it follow me for more content! -> Linktree

You can find the code here -> Gist

Top comments (0)