DEV Community: Vic

Tutorial: Web Scraping LinkedIn Jobs with Playwright.

Vic — Sun, 30 Jul 2023 15:48:50 +0000

Tutorial: Web Scraping LinkedIn Jobs with Playwright.

In this tutorial, we will explore a Python script that uses Playwright and Scrapy to scrape job listings from LinkedIn. The script will log in to LinkedIn, apply various search parameters, and scrape job details. We'll then save the scraped data to a CSV file.

Overview

The Python script is designed to perform the following tasks:

Log in to LinkedIn using provided credentials.
Load search parameters from a YAML configuration file.
Scrape job listings based on the search parameters.
Save the scraped job data to a CSV file.

Requirements

Before proceeding with the tutorial, make sure you have the following installed:

Python 3.x
playwright-sync library (playwright and playwright-sync should be installed)
scrapy library
pandas library
rich library
click library
YAML file with LinkedIn login credentials and search parameters (config.yaml)

How to Use the Script

Prepare the YAML Configuration File:
- Create a YAML file (e.g., config.yaml) with the following structure:

email: YOUR_LINKEDIN_EMAIL
password: YOUR_LINKEDIN_PASSWORD
params:
  - KEYWORD_SEARCH_1:
      keywords: "KEYWORD_1 KEYWORD_2"
      location: "CITY_NAME"
  - KEYWORD_SEARCH_2:
      keywords: "KEYWORD_3"
      location: "CITY_NAME"

Replace YOUR_LINKEDIN_EMAIL and YOUR_LINKEDIN_PASSWORD with your LinkedIn credentials. Add additional params entries to define different job searches. Each params entry should have a unique key (e.g., KEYWORD_SEARCH_1, KEYWORD_SEARCH_2) and include the keywords and location for the search.

Open a Terminal or Command Prompt:
- Navigate to the directory where the script and the config.yaml file are located.
Run the Script:
- To execute the script, use the following command:

python your_script_name.py

Replace your_script_name.py with the actual name of the Python script containing the provided code.
Command-Line Options:
The script supports several command-line options that you can pass when running the script:
- -config: Specifies the path to the YAML config file. If not provided, it defaults to 'config.yaml'.
- -headless/--no-headless: Specifies whether to run the browser in headless mode or not. By default, it runs in headless mode (-headless).
- -last24h: An optional flag that, if provided, instructs the script to filter jobs posted within the last 24 hours only.
Here's an example of how you can use command-line options:
```
python your_script_name.py --config path/to/custom_config.yaml --no-headless --last24h
```
In this example, we specified a custom config file (path/to/custom_config.yaml), set the browser to run in non-headless mode (--no-headless), and enabled the filter for jobs posted within the last 24 hours (--last24h).

Step 1: Import Required Libraries

Let's start by importing the required libraries in the Python script:

import yaml
from urllib.parse import urlencode, urljoin
from playwright.sync_api import sync_playwright
from scrapy import Selector
from dataclasses import dataclass
import pandas as pd
import logging
import re
from rich.logging import RichHandler
import click
import sys

Step 2: Configure Logging

The script uses the logging module for better log outputs. It also uses the rich library to enhance the logging with colors and additional formatting. The logging configuration is done as follows:

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
rich_handler = RichHandler(rich_tracebacks=True)
logging.getLogger().handlers = [rich_handler]

Step 3: Define Dataclass for Job Details

The script uses a dataclass named Job to store information about each job listing. The Job dataclass contains the following attributes: url, job_title, job_id, company_name, company_image, and job_location.

@dataclass
class Job:
    url: str
    job_title: str
    job_id: int
    company_name: str
    company_image: str
    job_location: str

Step 4: Define Global Variables

The script uses a global variable PAGE_NUMBER to keep track of the current page number while scraping job listings.

PAGE_NUMBER = 1

Step 5: Implement LinkedIn Login

Next, we implement the login_to_linkedin function, which logs in to LinkedIn using provided credentials. If the script runs in headless mode and encounters a captcha challenge, it will abort the process. The function accepts the page object (Playwright page object), email, password, and headless boolean parameter.

def login_to_linkedin(page, email, password, headless):
    # Go to the LinkedIn login page
    page.goto("https://www.linkedin.com/uas/login")
    page.wait_for_load_state('load')

    # Fill in the login credentials and click the login button
    page.get_by_label("Email o teléfono").click()
    page.get_by_label("Email o teléfono").fill(email)
    page.get_by_label("Contraseña").click()
    page.get_by_label("Contraseña").fill(password)
    page.locator("#organic-div form").get_by_role("button", name="Iniciar sesión").click()
    page.wait_for_load_state('load')

    if "checkpoint/challenge" in page.url and not headless:
        logger.warning("Captcha page! Human intervention is needed!")
        # Polling loop to check if captcha is solved
        while True:
            if "checkpoint/challenge" not in page.url:
                logger.info("Captcha solved. Continuing with the rest of the process.")
                break
            page.wait_for_timeout(2000)  # Wait for 2 seconds before polling again
        page.wait_for_timeout(5000)
    else:
        logger.error("Captcha page! Aborting due to headless mode...")
        sys.exit(1)

Step 6: Scrape Jobs

The scrape_jobs function is responsible for scraping job listings based on provided search parameters. It accepts the page object (Playwright page object), params (search parameters), and last24h (boolean flag to scrape only jobs from the last 24 hours).

def scrape_jobs(page, params, last24h):
    global PAGE_NUMBER
    main_url = "https://www.linkedin.com/jobs/"

    base_url = 'https://www.linkedin.com/jobs/search/'
    url = f'{base_url}?{urlencode(params)}'

    # List to store job data
    job_list = []

    # Go to the search results page
    page.goto(url)
    page.wait_for_load_state('load')

    # Apply the "last 24 hours" filter if required
    if last24h:
        page.get_by_role("button", name="Filtro «Fecha de publicación». Al hacer clic en este botón, se muestran todas las opciones del filtro «Fecha de publicación».").click()
        page.locator("label").filter(has_text="Últimas 24 horas Filtrar por «Últimas 24 horas»").click()
        pattern = r"Aplicar el filtro actual para mostrar (\d+\+?) resultados"
        page.get_by_role("button", name=re.compile(pattern, re.IGNORECASE)).click()

    # Loop through the job listings on the page and scrape details
    while True:
        page.locator("div.jobs-search-results-list").click()
        for _ in range(15):
            page.mouse.wheel(0, 250)
        page.wait_for_timeout(3000)
        response = Selector(text=page.content())

        jobs = response.css("ul.scaffold-layout__list-container li.ember-view")
        for job in jobs:
            job_info = Job(
                url=urljoin(main_url, job.css("a::attr(href)").get()) if job.css("a::attr(href)").get() else None,
                job_title=job.css("a::attr(aria-label)").get(),
                job_id=job.css("::attr(data-occludable-job-id)").get(),
                company_name=" ".join(job.css("img ::attr(alt)").get().split(" ")[2::]) if job.css("img ::attr(alt)").get() else None,
                company_image=job.css("img ::attr(src)").get(),
                job_location=" ".join(job.css(".job-card-container__metadata-item ::text").getall()) if job.css(
                    ".job-card-container__metadata-item ::text").get() else None
            )
            job_list.append(job_info)
            logger.info(f"Scraped job: {job_info.job_title}")

        # Check if there is a "Next" button and click it to move to the next page
        try:
            PAGE_NUMBER += 1
            page.get_by_role("button", name=f"Página {PAGE_NUMBER}",

Step 7: Defining the Command-line Interface and Main Function

The script uses click to create a command-line interface for the scraping process. The main function is named main:

@click.command()
@click.option('--config', type=click.Path(exists=True), default='config.yaml', help='Path to the YAML config file')
@click.option('--headless/--no-headless', default=True, help='Run the browser in headless mode or not')
@click.option('--last24h', is_flag=True, default=False, help='Make the browser go for last 24h jobs only')
def main(config, headless, last24h):
    # Load the YAML file with the list of search parameters
    with open('config.yaml', 'r') as f:
        data = yaml.safe_load(f)

    email = data.get("email")
    password = data.get("password")
    params_list = data.get("params")

    # Start the browser
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=headless)
        page = browser.new_page(locale="es-ES")  # Changed the locale to English

        # Login to LinkedIn once
        login_to_linkedin(page, email, password, headless)

        all_jobs = []
        for params in params_list:
            logger.info(f"Crawl starting... Params: {params}")
            jobs = scrape_jobs(page, params, last24h)
            all_jobs.extend(jobs)

        # Create a DataFrame from the combined job_list
        df = pd.DataFrame([job.__dict__ for job in all_jobs])

        # Save DataFrame to a CSV file
        csv_file_path = 'jobs_data.csv'
        df.to_csv(csv_file_path, index=False)

        # Log the number of jobs scraped and saved
        logger.info(f"Scraped {len(all_jobs)} jobs and saved to jobs_data.csv")

        browser.close()

if __name__ == '__main__':
    main()

If you liked it follow me for more content! -> Linktree

You can find the code here -> Gist

Scraping Real State Website

Vic — Sun, 23 Jul 2023 11:51:55 +0000

linktree

This Python script uses the Scrapy, requests, and price_parser libraries to scrape a website that lists properties for sale. It extracts details about each property such as price, title, address, number of baths and rooms, area, owner info, owner url, and coordinates (latitude, longitude).

Libraries

Scrapy: An open-source web-crawling framework for Python.
requests: A library to send all kinds of HTTP requests.
price_parser: A library to extract price and currency from raw text strings.

Let's dissect this script step-by-step:

Import Libraries

from scrapy import Selector
import requests
from urllib.parse import urljoin
from price_parser import Price

The above lines import the necessary Python libraries for the script.

Setting the Initial Variables

response = requests.get("https://www.pisos.com/venta/pisos-cedeira/")
sel = Selector(response)

home_url = "https://www.pisos.com"

The script sends a GET request to the URL of the website and uses the Selectorclass from Scrapy to create an object that can be used for parsing the HTML.

Number Filtering Function

def number_filtering(number):
    if type(number) == int:
        return number
    if type(number) == float:
        return(round(number))
    if type(number) == str:
        number = Price.fromstring(number)
        number = number.amount
        if number is None:
            return None
        try:
            return int(number)
        except Exception:
            return float(number)

This function converts string-based numbers into their integer or float representations. If the input is already an integer or a float, it returns the input as it is.

Get Text Between Substrings Function

def get_text_between(full_string, start_substring, end_substring):
    start = full_string.find(start_substring) + len(start_substring)
    end = full_string.find(end_substring, start)
    return "" if start == -1 or end == -1 else full_string[start:end]

This function takes three arguments: the full string and two substrings. It finds the text located between the two substrings.

Get Latitude and Longitude Function

def get_lat_lon(response):
    selector = Selector(response)
    lat = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Lat = ", ";")
    lon = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Long = ", ";")
    return lat, lon

This function extracts the latitude and longitude values from the JavaScript included in the page's HTML.

Parse Ad Function

def parse_ad(ad_response):
    ...
    print(f"Price: {price}")
    print(f"Title: {title}")
    print(f"Address: {address}")
    print(f"N_baths: {n_baths}")
    print(f"N_rooms: {n_rooms}")
    print(f"Area: {area}")
    print(f"Owner info: {owner_info}")
    print(f"Owner url: {owner_url}")
    print(f"Description: {description}")
    print(f"Source id: {source_id}")
    print(f"Latitude: {lat}")
    print(f"Longitude: {lon}")
    print("=============================================================")

This function parses the HTML of an ad and prints out the data about the property. It extracts the price, title, address, number of baths and rooms, area, owner info, owner url, description, source id, and coordinates (latitude, longitude) from the ad's HTML.

Parse All Ads

all_ads = sel.css("div.ad-preview")
for ad in all_ads:
    url = ad.css("a::attr(href)").get()
    ad_response = requests.get(urljoin(home_url, url))
    parse_ad(ad_response)

Finally, the script iterates over all ad preview divs, sends a request to each ad's URL, and then parses the response with the parse_ad() function.

Full code -> https://gist.github.com/VictorLG98/994874841e52213cf20e7c2a91ee781a

Video on my Youtube -> linktree

Web Scraping CLI tool for scanning websites

Vic — Fri, 21 Jul 2023 10:09:08 +0000

Web Scraping CLI tool for scanning websites

This Python script is a web scraper built using several Python libraries, including click, requests, beautifulsoup4, chompjs, and rich.

Import Libraries First, the script starts by importing the necessary libraries.

Define Main Function: scrape The scrape function is the main function of the script, which is decorated with click annotations to allow command-line arguments and options. It accepts a URL to scrape, a flag to save the result, and a filename for saving the result.

Instantiate Console Inside the scrape function, a rich Console object is created, which provides more aesthetically pleasing console output.

Time the Request The time it takes to make the request to the provided URL is calculated. This can be useful for performance considerations.

Parse HTML The response from the request is parsed using the BeautifulSoupclass from the beautifulsoup4 library. This enables easy access and manipulation of the webpage's HTML.

Print Request Information Information about the request is printed to the console, including the original URL requested, the domain, the final URL after any redirects, the response time, the response size, the status code, the response headers, the request headers, and any cookies set by the server.

Print HTML Code The parsed HTML is then pretty-printed to the console using the rich library's Syntax class, unless the --save flag was used when calling the script.

Find JavaScript Objects The script then attempts to parse any JavaScript objects found in script tags on the page. This is done using the get_js_objects function, which is defined later.

Save HTML If the --save flag was used when calling the script, the HTML of the webpage is written to a file with the provided filename.

Define get_js_objectsFunction The get_js_objects function is used to find and parse JavaScript objects in script tags of the webpage. This function is used in the scrape function to extract any JavaScript data on the webpage.

Run the script Finally, if the script is run as the main file, the scrape function is called with the command-line arguments and options.

Here's the code in full:

import click
import requests
from bs4 import BeautifulSoup
from chompjs import parse_js_object
from rich.console import Console
from rich.syntax import Syntax
import time
from urllib.parse import urlparse

@click.command()
@click.argument('url')
@click.option('--save', is_flag=True, help='Save results to a file')
@click.option('--filename', default='results.html', help='Specify the filename')
def scrape(url, save, filename):
    console = Console()

    start_time = time.time()
    response = requests.get(url, allow_redirects=True)
    end_time = time.time()

    response_time = end_time - start_time
    response_size = len(response.content)

    soup = BeautifulSoup(response.content, 'html.parser')

    console.print(f"URL Requested: {url}", style="bold green")
    console.print(f"Domain: {urlparse(url).netloc}", style="bold green")
    console.print(f"Final URL: {response.url}", style="bold green")
    console.print(f"Response time: {response_time} seconds", style="bold green")
    console.print(f"Response size: {response_size} bytes", style="bold green")
    console.print(f"Status code: {response.status_code}", style="bold green")
    console.print(f"Response headers: {response.headers}", style="bold green")
    console.print(f"Request headers: {response.request.headers}", style="bold green")
    console.print(f"Cookies: {response.cookies}", style="bold green")

    syntax = Syntax(soup.prettify(), "html", theme="monokai", line_numbers=True)
    if not save:
        console.print("HTML code: ", style="bold red")   
        console.print(syntax)

    get_content_sources, failed = get_js_objects(response)
    console.print("JavaScript data sources found: ", style="bold red")
    console.print(get_content_sources)

    console.print("JavaScript data sources failed: ", style="bold red")
    console.print(failed)

    if save:
        with open(filename, 'w') as f:
            f.write(soup.prettify())

def get_js_objects(response: requests.models.Response) -> list:
    script_tags = BeautifulSoup(response.content, 'html.parser').find_all('script')
    all_data_sources = []
    failed = []
    for script in script_tags:
        if script.string:
            try:
                all_data_sources.append(parse_js_object(script.string))
            except Exception:
                failed.append(script.string)

    return all_data_sources, failed

if __name__ == '__main__':
    scrape()

If you found this useful, don't forget to follow me on my social networks! Linktree