Tutorial: Web Scraping LinkedIn Jobs with Playwright.
In this tutorial, we will explore a Python script that uses Playwright and Scrapy to scrape job listings from LinkedIn. The script will log in to LinkedIn, apply various search parameters, and scrape job details. We'll then save the scraped data to a CSV file.
Overview
The Python script is designed to perform the following tasks:
- Log in to LinkedIn using provided credentials.
- Load search parameters from a YAML configuration file.
- Scrape job listings based on the search parameters.
- Save the scraped job data to a CSV file.
Requirements
Before proceeding with the tutorial, make sure you have the following installed:
- Python 3.x
-
playwright-sync
library (playwright
andplaywright-sync
should be installed) -
scrapy
library -
pandas
library -
rich
library -
click
library - YAML file with LinkedIn login credentials and search parameters (config.yaml)
How to Use the Script
- Prepare the YAML Configuration File:
- Create a YAML file (e.g.,
config.yaml
) with the following structure:
- Create a YAML file (e.g.,
email: YOUR_LINKEDIN_EMAIL
password: YOUR_LINKEDIN_PASSWORD
params:
- KEYWORD_SEARCH_1:
keywords: "KEYWORD_1 KEYWORD_2"
location: "CITY_NAME"
- KEYWORD_SEARCH_2:
keywords: "KEYWORD_3"
location: "CITY_NAME"
Replace YOUR_LINKEDIN_EMAIL
and YOUR_LINKEDIN_PASSWORD
with your LinkedIn credentials. Add additional params
entries to define different job searches. Each params
entry should have a unique key (e.g., KEYWORD_SEARCH_1
, KEYWORD_SEARCH_2
) and include the keywords
and location
for the search.
- Open a Terminal or Command Prompt:
- Navigate to the directory where the script and the
config.yaml
file are located.
- Navigate to the directory where the script and the
- Run the Script:
- To execute the script, use the following command:
python your_script_name.py
- Replace
your_script_name.py
with the actual name of the Python script containing the provided code. -
Command-Line Options:
The script supports several command-line options that you can pass when running the script:-
-config
: Specifies the path to the YAML config file. If not provided, it defaults to'config.yaml'
. -
-headless/--no-headless
: Specifies whether to run the browser in headless mode or not. By default, it runs in headless mode (-headless
). -
-last24h
: An optional flag that, if provided, instructs the script to filter jobs posted within the last 24 hours only.
Here's an example of how you can use command-line options:
python your_script_name.py --config path/to/custom_config.yaml --no-headless --last24h
In this example, we specified a custom config file (
path/to/custom_config.yaml
), set the browser to run in non-headless mode (--no-headless
), and enabled the filter for jobs posted within the last 24 hours (--last24h
). -
Step 1: Import Required Libraries
Let's start by importing the required libraries in the Python script:
import yaml
from urllib.parse import urlencode, urljoin
from playwright.sync_api import sync_playwright
from scrapy import Selector
from dataclasses import dataclass
import pandas as pd
import logging
import re
from rich.logging import RichHandler
import click
import sys
Step 2: Configure Logging
The script uses the logging
module for better log outputs. It also uses the rich
library to enhance the logging with colors and additional formatting. The logging configuration is done as follows:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
rich_handler = RichHandler(rich_tracebacks=True)
logging.getLogger().handlers = [rich_handler]
Step 3: Define Dataclass for Job Details
The script uses a dataclass
named Job
to store information about each job listing. The Job
dataclass contains the following attributes: url
, job_title
, job_id
, company_name
, company_image
, and job_location
.
@dataclass
class Job:
url: str
job_title: str
job_id: int
company_name: str
company_image: str
job_location: str
Step 4: Define Global Variables
The script uses a global variable PAGE_NUMBER
to keep track of the current page number while scraping job listings.
PAGE_NUMBER = 1
Step 5: Implement LinkedIn Login
Next, we implement the login_to_linkedin
function, which logs in to LinkedIn using provided credentials. If the script runs in headless mode and encounters a captcha challenge, it will abort the process. The function accepts the page
object (Playwright page object), email
, password
, and headless
boolean parameter.
def login_to_linkedin(page, email, password, headless):
# Go to the LinkedIn login page
page.goto("https://www.linkedin.com/uas/login")
page.wait_for_load_state('load')
# Fill in the login credentials and click the login button
page.get_by_label("Email o teléfono").click()
page.get_by_label("Email o teléfono").fill(email)
page.get_by_label("Contraseña").click()
page.get_by_label("Contraseña").fill(password)
page.locator("#organic-div form").get_by_role("button", name="Iniciar sesión").click()
page.wait_for_load_state('load')
if "checkpoint/challenge" in page.url and not headless:
logger.warning("Captcha page! Human intervention is needed!")
# Polling loop to check if captcha is solved
while True:
if "checkpoint/challenge" not in page.url:
logger.info("Captcha solved. Continuing with the rest of the process.")
break
page.wait_for_timeout(2000) # Wait for 2 seconds before polling again
page.wait_for_timeout(5000)
else:
logger.error("Captcha page! Aborting due to headless mode...")
sys.exit(1)
Step 6: Scrape Jobs
The scrape_jobs
function is responsible for scraping job listings based on provided search parameters. It accepts the page
object (Playwright page object), params
(search parameters), and last24h
(boolean flag to scrape only jobs from the last 24 hours).
def scrape_jobs(page, params, last24h):
global PAGE_NUMBER
main_url = "https://www.linkedin.com/jobs/"
base_url = 'https://www.linkedin.com/jobs/search/'
url = f'{base_url}?{urlencode(params)}'
# List to store job data
job_list = []
# Go to the search results page
page.goto(url)
page.wait_for_load_state('load')
# Apply the "last 24 hours" filter if required
if last24h:
page.get_by_role("button", name="Filtro «Fecha de publicación». Al hacer clic en este botón, se muestran todas las opciones del filtro «Fecha de publicación».").click()
page.locator("label").filter(has_text="Últimas 24 horas Filtrar por «Últimas 24 horas»").click()
pattern = r"Aplicar el filtro actual para mostrar (\d+\+?) resultados"
page.get_by_role("button", name=re.compile(pattern, re.IGNORECASE)).click()
# Loop through the job listings on the page and scrape details
while True:
page.locator("div.jobs-search-results-list").click()
for _ in range(15):
page.mouse.wheel(0, 250)
page.wait_for_timeout(3000)
response = Selector(text=page.content())
jobs = response.css("ul.scaffold-layout__list-container li.ember-view")
for job in jobs:
job_info = Job(
url=urljoin(main_url, job.css("a::attr(href)").get()) if job.css("a::attr(href)").get() else None,
job_title=job.css("a::attr(aria-label)").get(),
job_id=job.css("::attr(data-occludable-job-id)").get(),
company_name=" ".join(job.css("img ::attr(alt)").get().split(" ")[2::]) if job.css("img ::attr(alt)").get() else None,
company_image=job.css("img ::attr(src)").get(),
job_location=" ".join(job.css(".job-card-container__metadata-item ::text").getall()) if job.css(
".job-card-container__metadata-item ::text").get() else None
)
job_list.append(job_info)
logger.info(f"Scraped job: {job_info.job_title}")
# Check if there is a "Next" button and click it to move to the next page
try:
PAGE_NUMBER += 1
page.get_by_role("button", name=f"Página {PAGE_NUMBER}",
Step 7: Defining the Command-line Interface and Main Function
The script uses click
to create a command-line interface for the scraping process. The main function is named main
:
@click.command()
@click.option('--config', type=click.Path(exists=True), default='config.yaml', help='Path to the YAML config file')
@click.option('--headless/--no-headless', default=True, help='Run the browser in headless mode or not')
@click.option('--last24h', is_flag=True, default=False, help='Make the browser go for last 24h jobs only')
def main(config, headless, last24h):
# Load the YAML file with the list of search parameters
with open('config.yaml', 'r') as f:
data = yaml.safe_load(f)
email = data.get("email")
password = data.get("password")
params_list = data.get("params")
# Start the browser
with sync_playwright() as p:
browser = p.chromium.launch(headless=headless)
page = browser.new_page(locale="es-ES") # Changed the locale to English
# Login to LinkedIn once
login_to_linkedin(page, email, password, headless)
all_jobs = []
for params in params_list:
logger.info(f"Crawl starting... Params: {params}")
jobs = scrape_jobs(page, params, last24h)
all_jobs.extend(jobs)
# Create a DataFrame from the combined job_list
df = pd.DataFrame([job.__dict__ for job in all_jobs])
# Save DataFrame to a CSV file
csv_file_path = 'jobs_data.csv'
df.to_csv(csv_file_path, index=False)
# Log the number of jobs scraped and saved
logger.info(f"Scraped {len(all_jobs)} jobs and saved to jobs_data.csv")
browser.close()
if __name__ == '__main__':
main()
If you liked it follow me for more content! -> Linktree
You can find the code here -> Gist
Top comments (0)