<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vic</title>
    <description>The latest articles on DEV Community by Vic (@victorlg98).</description>
    <link>https://dev.to/victorlg98</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1123930%2Fb43eaccc-fa7e-4f7e-8a20-364142b423b3.jpeg</url>
      <title>DEV Community: Vic</title>
      <link>https://dev.to/victorlg98</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/victorlg98"/>
    <language>en</language>
    <item>
      <title>Tutorial: Web Scraping LinkedIn Jobs with Playwright.</title>
      <dc:creator>Vic</dc:creator>
      <pubDate>Sun, 30 Jul 2023 15:48:50 +0000</pubDate>
      <link>https://dev.to/victorlg98/tutorial-web-scraping-linkedin-jobs-with-playwright-2h7l</link>
      <guid>https://dev.to/victorlg98/tutorial-web-scraping-linkedin-jobs-with-playwright-2h7l</guid>
      <description>&lt;h1&gt;
  
  
  Tutorial: Web Scraping LinkedIn Jobs with Playwright.
&lt;/h1&gt;

&lt;p&gt;In this tutorial, we will explore a Python script that uses Playwright and Scrapy to scrape job listings from LinkedIn. The script will log in to LinkedIn, apply various search parameters, and scrape job details. We'll then save the scraped data to a CSV file.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Overview&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The Python script is designed to perform the following tasks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log in to LinkedIn using provided credentials.&lt;/li&gt;
&lt;li&gt;Load search parameters from a YAML configuration file.&lt;/li&gt;
&lt;li&gt;Scrape job listings based on the search parameters.&lt;/li&gt;
&lt;li&gt;Save the scraped job data to a CSV file.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Requirements&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before proceeding with the tutorial, make sure you have the following installed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;playwright-sync&lt;/code&gt;&lt;/strong&gt; library (&lt;strong&gt;&lt;code&gt;playwright&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;playwright-sync&lt;/code&gt;&lt;/strong&gt; should be installed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;scrapy&lt;/code&gt;&lt;/strong&gt; library&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pandas&lt;/code&gt;&lt;/strong&gt; library&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;rich&lt;/code&gt;&lt;/strong&gt; library&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;click&lt;/code&gt;&lt;/strong&gt; library&lt;/li&gt;
&lt;li&gt;YAML file with LinkedIn login credentials and search parameters (config.yaml)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;em&gt;How to Use the Script&lt;/em&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Prepare the YAML Configuration File:

&lt;ul&gt;
&lt;li&gt;Create a YAML file (e.g., &lt;strong&gt;&lt;code&gt;config.yaml&lt;/code&gt;&lt;/strong&gt;) with the following structure:
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;YOUR_LINKEDIN_EMAIL&lt;/span&gt;
&lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;YOUR_LINKEDIN_PASSWORD&lt;/span&gt;
&lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;KEYWORD_SEARCH_1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;keywords&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KEYWORD_1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;KEYWORD_2"&lt;/span&gt;
      &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CITY_NAME"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;KEYWORD_SEARCH_2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;keywords&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KEYWORD_3"&lt;/span&gt;
      &lt;span class="na"&gt;location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CITY_NAME"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;strong&gt;&lt;code&gt;YOUR_LINKEDIN_EMAIL&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;YOUR_LINKEDIN_PASSWORD&lt;/code&gt;&lt;/strong&gt; with your LinkedIn credentials. Add additional &lt;strong&gt;&lt;code&gt;params&lt;/code&gt;&lt;/strong&gt; entries to define different job searches. Each &lt;strong&gt;&lt;code&gt;params&lt;/code&gt;&lt;/strong&gt; entry should have a unique key (e.g., &lt;strong&gt;&lt;code&gt;KEYWORD_SEARCH_1&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;KEYWORD_SEARCH_2&lt;/code&gt;&lt;/strong&gt;) and include the &lt;strong&gt;&lt;code&gt;keywords&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;location&lt;/code&gt;&lt;/strong&gt; for the search.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open a Terminal or Command Prompt:

&lt;ul&gt;
&lt;li&gt;Navigate to the directory where the script and the &lt;strong&gt;&lt;code&gt;config.yaml&lt;/code&gt;&lt;/strong&gt; file are located.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Run the Script:

&lt;ul&gt;
&lt;li&gt;To execute the script, use the following command:
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python your_script_name.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Replace &lt;strong&gt;&lt;code&gt;your_script_name.py&lt;/code&gt;&lt;/strong&gt; with the actual name of the Python script containing the provided code.&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Command-Line Options:&lt;br&gt;
The script supports several command-line options that you can pass when running the script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-config&lt;/code&gt;&lt;/strong&gt;: Specifies the path to the YAML config file. If not provided, it defaults to &lt;strong&gt;&lt;code&gt;'config.yaml'&lt;/code&gt;&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-headless/--no-headless&lt;/code&gt;&lt;/strong&gt;: Specifies whether to run the browser in headless mode or not. By default, it runs in headless mode (&lt;strong&gt;&lt;code&gt;-headless&lt;/code&gt;&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-last24h&lt;/code&gt;&lt;/strong&gt;: An optional flag that, if provided, instructs the script to filter jobs posted within the last 24 hours only.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's an example of how you can use command-line options:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python your_script_name.py &lt;span class="nt"&gt;--config&lt;/span&gt; path/to/custom_config.yaml &lt;span class="nt"&gt;--no-headless&lt;/span&gt; &lt;span class="nt"&gt;--last24h&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;In this example, we specified a custom config file (&lt;strong&gt;&lt;code&gt;path/to/custom_config.yaml&lt;/code&gt;&lt;/strong&gt;), set the browser to run in non-headless mode (&lt;strong&gt;&lt;code&gt;--no-headless&lt;/code&gt;&lt;/strong&gt;), and enabled the filter for jobs posted within the last 24 hours (&lt;strong&gt;&lt;code&gt;--last24h&lt;/code&gt;&lt;/strong&gt;).&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Import Required Libraries&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let's start by importing the required libraries in the Python script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;yaml&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;urllib.parse&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;urlencode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;urljoin&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;scrapy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Selector&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;rich.logging&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RichHandler&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;click&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Configure Logging&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The script uses the &lt;strong&gt;&lt;code&gt;logging&lt;/code&gt;&lt;/strong&gt; module for better log outputs. It also uses the &lt;strong&gt;&lt;code&gt;rich&lt;/code&gt;&lt;/strong&gt; library to enhance the logging with colors and additional formatting. The logging configuration is done as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'%(asctime)s - %(levelname)s - %(message)s'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rich_handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RichHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rich_tracebacks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;handlers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rich_handler&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Define Dataclass for Job Details&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The script uses a &lt;strong&gt;&lt;code&gt;dataclass&lt;/code&gt;&lt;/strong&gt; named &lt;strong&gt;&lt;code&gt;Job&lt;/code&gt;&lt;/strong&gt; to store information about each job listing. The &lt;strong&gt;&lt;code&gt;Job&lt;/code&gt;&lt;/strong&gt; dataclass contains the following attributes: &lt;strong&gt;&lt;code&gt;url&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;job_title&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;job_id&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;company_name&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;company_image&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;job_location&lt;/code&gt;&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;job_title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;company_image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;job_location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Define Global Variables&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The script uses a global variable &lt;strong&gt;&lt;code&gt;PAGE_NUMBER&lt;/code&gt;&lt;/strong&gt; to keep track of the current page number while scraping job listings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;PAGE_NUMBER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Implement LinkedIn Login&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Next, we implement the &lt;strong&gt;&lt;code&gt;login_to_linkedin&lt;/code&gt;&lt;/strong&gt; function, which logs in to LinkedIn using provided credentials. If the script runs in headless mode and encounters a captcha challenge, it will abort the process. The function accepts the &lt;strong&gt;&lt;code&gt;page&lt;/code&gt;&lt;/strong&gt; object (Playwright page object), &lt;strong&gt;&lt;code&gt;email&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;password&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;headless&lt;/code&gt;&lt;/strong&gt; boolean parameter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;login_to_linkedin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Go to the LinkedIn login page
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"https://www.linkedin.com/uas/login"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_load_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'load'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Fill in the login credentials and click the login button
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Email o teléfono"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Email o teléfono"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Contraseña"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_label&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Contraseña"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"#organic-div form"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get_by_role&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"button"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Iniciar sesión"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_load_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'load'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;"checkpoint/challenge"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Captcha page! Human intervention is needed!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Polling loop to check if captcha is solved
&lt;/span&gt;        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s"&gt;"checkpoint/challenge"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Captcha solved. Continuing with the rest of the process."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Wait for 2 seconds before polling again
&lt;/span&gt;        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Captcha page! Aborting due to headless mode..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 6: Scrape Jobs&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;scrape_jobs&lt;/code&gt;&lt;/strong&gt; function is responsible for scraping job listings based on provided search parameters. It accepts the &lt;strong&gt;&lt;code&gt;page&lt;/code&gt;&lt;/strong&gt; object (Playwright page object), &lt;strong&gt;&lt;code&gt;params&lt;/code&gt;&lt;/strong&gt; (search parameters), and &lt;strong&gt;&lt;code&gt;last24h&lt;/code&gt;&lt;/strong&gt; (boolean flag to scrape only jobs from the last 24 hours).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_jobs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last24h&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;PAGE_NUMBER&lt;/span&gt;
    &lt;span class="n"&gt;main_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://www.linkedin.com/jobs/"&lt;/span&gt;

    &lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'https://www.linkedin.com/jobs/search/'&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;urlencode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# List to store job data
&lt;/span&gt;    &lt;span class="n"&gt;job_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# Go to the search results page
&lt;/span&gt;    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_load_state&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'load'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply the "last 24 hours" filter if required
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;last24h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_role&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"button"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Filtro «Fecha de publicación». Al hacer clic en este botón, se muestran todas las opciones del filtro «Fecha de publicación»."&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;has_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"Últimas 24 horas Filtrar por «Últimas 24 horas»"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s"&gt;"Aplicar el filtro actual para mostrar (\d+\+?) resultados"&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_role&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"button"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Loop through the job listings on the page and scrape details
&lt;/span&gt;    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"div.jobs-search-results-list"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mouse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wheel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wait_for_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ul.scaffold-layout__list-container li.ember-view"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;job_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Job&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;urljoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"a::attr(href)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"a::attr(href)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;job_title&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"a::attr(aria-label)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="n"&gt;job_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"::attr(data-occludable-job-id)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="n"&gt;company_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"img ::attr(alt)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;::])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"img ::attr(alt)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;company_image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"img ::attr(src)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="n"&gt;job_location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;" "&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".job-card-container__metadata-item ::text"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="s"&gt;".job-card-container__metadata-item ::text"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;job_list&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;job_info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Scraped job: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;job_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_title&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Check if there is a "Next" button and click it to move to the next page
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;PAGE_NUMBER&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_by_role&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"button"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Página &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PAGE_NUMBER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 7: Defining the Command-line Interface and Main Function&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The script uses &lt;strong&gt;&lt;code&gt;click&lt;/code&gt;&lt;/strong&gt; to create a command-line interface for the scraping process. The main function is named &lt;strong&gt;&lt;code&gt;main&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'--config'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'config.yaml'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Path to the YAML config file'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'--headless/--no-headless'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Run the browser in headless mode or not'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;click&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'--last24h'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_flag&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'Make the browser go for last 24h jobs only'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last24h&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Load the YAML file with the list of search parameters
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'config.yaml'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'r'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"email"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"password"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;params_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Start the browser
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;locale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"es-ES"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Changed the locale to English
&lt;/span&gt;
        &lt;span class="c1"&gt;# Login to LinkedIn once
&lt;/span&gt;        &lt;span class="n"&gt;login_to_linkedin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;all_jobs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;params_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Crawl starting... Params: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;jobs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scrape_jobs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last24h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;all_jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Create a DataFrame from the combined job_list
&lt;/span&gt;        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__dict__&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_jobs&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="c1"&gt;# Save DataFrame to a CSV file
&lt;/span&gt;        &lt;span class="n"&gt;csv_file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'jobs_data.csv'&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv_file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Log the number of jobs scraped and saved
&lt;/span&gt;        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s"&gt;"Scraped &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_jobs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; jobs and saved to jobs_data.csv"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;'__main__'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you liked it follow me for more content! -&amp;gt; &lt;a href="https://linktr.ee/victorlg"&gt;Linktree&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can find the code here -&amp;gt; &lt;a href="https://gist.github.com/VictorLG98/627c263aef47994b628476db017d5c66"&gt;Gist&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping Real State Website</title>
      <dc:creator>Vic</dc:creator>
      <pubDate>Sun, 23 Jul 2023 11:51:55 +0000</pubDate>
      <link>https://dev.to/victorlg98/scraping-real-state-website-4lc3</link>
      <guid>https://dev.to/victorlg98/scraping-real-state-website-4lc3</guid>
      <description>&lt;p&gt;&lt;a href="https://linktr.ee/victorlg"&gt;linktree&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This Python script uses the Scrapy, requests, and price_parser libraries to scrape a website that lists properties for sale. It extracts details about each property such as price, title, address, number of baths and rooms, area, owner info, owner url, and coordinates (latitude, longitude).&lt;/p&gt;

&lt;h2&gt;
  
  
  Libraries
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scrapy&lt;/strong&gt;: An open-source web-crawling framework for Python.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;requests&lt;/strong&gt;: A library to send all kinds of HTTP requests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;price_parser&lt;/strong&gt;: A library to extract price and currency from raw text strings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's dissect this script step-by-step:&lt;/p&gt;

&lt;h2&gt;
  
  
  Import Libraries
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from scrapy import Selector
import requests
from urllib.parse import urljoin
from price_parser import Price
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The above lines import the necessary Python libraries for the script.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting the Initial Variables
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;response = requests.get("https://www.pisos.com/venta/pisos-cedeira/")
sel = Selector(response)

home_url = "https://www.pisos.com"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The script sends a GET request to the URL of the website and uses the Selectorclass from Scrapy to create an object that can be used for parsing the HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Number Filtering Function
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def number_filtering(number):
    if type(number) == int:
        return number
    if type(number) == float:
        return(round(number))
    if type(number) == str:
        number = Price.fromstring(number)
        number = number.amount
        if number is None:
            return None
        try:
            return int(number)
        except Exception:
            return float(number)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This function converts string-based numbers into their integer or float representations. If the input is already an integer or a float, it returns the input as it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Text Between Substrings Function
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_text_between(full_string, start_substring, end_substring):
    start = full_string.find(start_substring) + len(start_substring)
    end = full_string.find(end_substring, start)
    return "" if start == -1 or end == -1 else full_string[start:end]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This function takes three arguments: the full string and two substrings. It finds the text located between the two substrings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Latitude and Longitude Function
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_lat_lon(response):
    selector = Selector(response)
    lat = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Lat = ", ";")
    lon = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Long = ", ";")
    return lat, lon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This function extracts the latitude and longitude values from the JavaScript included in the page's HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parse Ad Function
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def parse_ad(ad_response):
    ...
    print(f"Price: {price}")
    print(f"Title: {title}")
    print(f"Address: {address}")
    print(f"N_baths: {n_baths}")
    print(f"N_rooms: {n_rooms}")
    print(f"Area: {area}")
    print(f"Owner info: {owner_info}")
    print(f"Owner url: {owner_url}")
    print(f"Description: {description}")
    print(f"Source id: {source_id}")
    print(f"Latitude: {lat}")
    print(f"Longitude: {lon}")
    print("=============================================================")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This function parses the HTML of an ad and prints out the data about the property. It extracts the price, title, address, number of baths and rooms, area, owner info, owner url, description, source id, and coordinates (latitude, longitude) from the ad's HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parse All Ads
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;all_ads = sel.css("div.ad-preview")
for ad in all_ads:
    url = ad.css("a::attr(href)").get()
    ad_response = requests.get(urljoin(home_url, url))
    parse_ad(ad_response)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Finally, the script iterates over all ad preview divs, sends a request to each ad's URL, and then parses the response with the &lt;em&gt;parse_ad()&lt;/em&gt; function.&lt;/p&gt;

&lt;p&gt;​&lt;/p&gt;

&lt;p&gt;Full code -&amp;gt; &lt;a href="https://gist.github.com/VictorLG98/994874841e52213cf20e7c2a91ee781a"&gt;https://gist.github.com/VictorLG98/994874841e52213cf20e7c2a91ee781a&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Video on my Youtube -&amp;gt; &lt;a href="https://linktr.ee/victorlg"&gt;linktree&lt;/a&gt;&lt;/p&gt;

</description>
      <category>scraping</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Web Scraping CLI tool for scanning websites</title>
      <dc:creator>Vic</dc:creator>
      <pubDate>Fri, 21 Jul 2023 10:09:08 +0000</pubDate>
      <link>https://dev.to/victorlg98/web-scraping-cli-tool-for-scanning-websites-59em</link>
      <guid>https://dev.to/victorlg98/web-scraping-cli-tool-for-scanning-websites-59em</guid>
      <description>&lt;h1&gt;
  
  
  Web Scraping CLI tool for scanning websites
&lt;/h1&gt;

&lt;p&gt;This Python script is a web scraper built using several Python libraries, including click, requests, beautifulsoup4, chompjs, and rich.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Import Libraries&lt;/strong&gt; First, the script starts by importing the necessary libraries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Define Main Function:&lt;/strong&gt; &lt;strong&gt;scrape&lt;/strong&gt; The scrape function is the main function of the script, which is decorated with click annotations to allow command-line arguments and options. It accepts a URL to scrape, a flag to save the result, and a filename for saving the result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instantiate Console&lt;/strong&gt; Inside the scrape function, a rich Console object is created, which provides more aesthetically pleasing console output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time the Request&lt;/strong&gt; The time it takes to make the request to the provided URL is calculated. This can be useful for performance considerations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parse HTML&lt;/strong&gt; The response from the request is parsed using the BeautifulSoupclass from the beautifulsoup4 library. This enables easy access and manipulation of the webpage's HTML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Print Request Information&lt;/strong&gt; Information about the request is printed to the console, including the original URL requested, the domain, the final URL after any redirects, the response time, the response size, the status code, the response headers, the request headers, and any cookies set by the server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Print HTML Code&lt;/strong&gt; The parsed HTML is then pretty-printed to the console using the rich library's Syntax class, unless the --save flag was used when calling the script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find JavaScript Objects&lt;/strong&gt; The script then attempts to parse any JavaScript objects found in script tags on the page. This is done using the get_js_objects function, which is defined later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Save HTML&lt;/strong&gt; If the --save flag was used when calling the script, the HTML of the webpage is written to a file with the provided filename.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Define&lt;/strong&gt; &lt;strong&gt;get_js_objectsFunction&lt;/strong&gt; The get_js_objects function is used to find and parse JavaScript objects in script tags of the webpage. This function is used in the scrape function to extract any JavaScript data on the webpage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run the script&lt;/strong&gt; Finally, if the script is run as the main file, the scrape function is called with the command-line arguments and options.&lt;/p&gt;

&lt;p&gt;Here's the code in full:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import click
import requests
from bs4 import BeautifulSoup
from chompjs import parse_js_object
from rich.console import Console
from rich.syntax import Syntax
import time
from urllib.parse import urlparse

@click.command()
@click.argument('url')
@click.option('--save', is_flag=True, help='Save results to a file')
@click.option('--filename', default='results.html', help='Specify the filename')
def scrape(url, save, filename):
    console = Console()

    start_time = time.time()
    response = requests.get(url, allow_redirects=True)
    end_time = time.time()

    response_time = end_time - start_time
    response_size = len(response.content)

    soup = BeautifulSoup(response.content, 'html.parser')

    console.print(f"URL Requested: {url}", style="bold green")
    console.print(f"Domain: {urlparse(url).netloc}", style="bold green")
    console.print(f"Final URL: {response.url}", style="bold green")
    console.print(f"Response time: {response_time} seconds", style="bold green")
    console.print(f"Response size: {response_size} bytes", style="bold green")
    console.print(f"Status code: {response.status_code}", style="bold green")
    console.print(f"Response headers: {response.headers}", style="bold green")
    console.print(f"Request headers: {response.request.headers}", style="bold green")
    console.print(f"Cookies: {response.cookies}", style="bold green")

    syntax = Syntax(soup.prettify(), "html", theme="monokai", line_numbers=True)
    if not save:
        console.print("HTML code: ", style="bold red")   
        console.print(syntax)

    get_content_sources, failed = get_js_objects(response)
    console.print("JavaScript data sources found: ", style="bold red")
    console.print(get_content_sources)

    console.print("JavaScript data sources failed: ", style="bold red")
    console.print(failed)

    if save:
        with open(filename, 'w') as f:
            f.write(soup.prettify())

def get_js_objects(response: requests.models.Response) -&amp;gt; list:
    script_tags = BeautifulSoup(response.content, 'html.parser').find_all('script')
    all_data_sources = []
    failed = []
    for script in script_tags:
        if script.string:
            try:
                all_data_sources.append(parse_js_object(script.string))
            except Exception:
                failed.append(script.string)

    return all_data_sources, failed

if __name__ == '__main__':
    scrape()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you found this useful, don't forget to follow me on my social networks! &lt;a href="https://linktr.ee/victorlg"&gt;Linktree&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>scrapy</category>
      <category>webscraping</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
