<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Percival Villalva</title>
    <description>The latest articles on DEV Community by Percival Villalva (@percivalvillal3).</description>
    <link>https://dev.to/percivalvillal3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1026685%2Fd3c5aa18-264e-4e7a-9a39-b6ee9c8fc0d4.jpg</url>
      <title>DEV Community: Percival Villalva</title>
      <link>https://dev.to/percivalvillal3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/percivalvillal3"/>
    <language>en</language>
    <item>
      <title>Crawlee data storage types: saving files, screenshots, and JSON results</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Mon, 27 Nov 2023 23:00:00 +0000</pubDate>
      <link>https://dev.to/apify/crawlee-data-storage-types-saving-files-screenshots-and-json-results-j9o</link>
      <guid>https://dev.to/apify/crawlee-data-storage-types-saving-files-screenshots-and-json-results-j9o</guid>
      <description>&lt;p&gt;&lt;strong&gt;We're&lt;/strong&gt; &lt;a href="https://apify.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;, a full-stack web scraping and browser automation platform. We are the maintainers of the open-source library&lt;/strong&gt; &lt;a href="https://crawlee.dev/" rel="noopener noreferrer"&gt;&lt;strong&gt;Crawlee&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Managing and storing the data you collect is a crucial part of any &lt;a href="https://blog.apify.com/what-is-web-scraping/" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt; and data extraction project. It's often a complex task, especially when handling large datasets and ensuring output accuracy. Fortunately, Crawlee simplifies this process with its versatile storage types.&lt;/p&gt;

&lt;p&gt;In this article, we will look at Crawlee's storage types and demonstrate how they can make our lives easier when extracting data from the web.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Setting up Crawlee&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Setting up a Crawlee project is straightforward, provided you &lt;a href="https://blog.apify.com/how-to-install-nodejs/" rel="noopener noreferrer"&gt;have Node&lt;/a&gt; and npm installed. To begin, create a new Crawlee project using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx crawlee create crawlee-data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running the command, you will be given a few template options to choose from. We will go with the CheerioCrawler JavaScript template. Remember, Crawlee's storage types are consistent across all crawlers, so the concepts we discuss here apply to any Crawlee crawler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z38knby90ahxbr4bsu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z38knby90ahxbr4bsu3.png" alt="Crawlee template options" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Crawlee template options&lt;/p&gt;

&lt;p&gt;Once installed, you'll find your new project in the &lt;code&gt;crawlee-data&lt;/code&gt; directory, ready with a template code that scrapes the &lt;a href="https://crawlee.dev/" rel="noopener noreferrer"&gt;crawlee.dev&lt;/a&gt; website:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08o7ya1h2bylhtofqbjl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08o7ya1h2bylhtofqbjl.png" alt="CheerioCrawler template code" width="800" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To test it, simply run &lt;code&gt;npm start&lt;/code&gt; in your terminal. You'll notice a &lt;code&gt;storage&lt;/code&gt; folder appear with subfolders like &lt;code&gt;datasets&lt;/code&gt;, &lt;code&gt;key_value_stores&lt;/code&gt;, and &lt;code&gt;request_queues&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6t97892ffkx2ji0wxjw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6t97892ffkx2ji0wxjw.png" alt="Crawlee storage" width="368" height="670"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Crawlee's storage can be divided into two categories: &lt;strong&gt;Request Storage (Request Queue and Request List)&lt;/strong&gt; and &lt;strong&gt;Results Storage (Datasets and Key Value Stores)&lt;/strong&gt;. Both are stored locally by default in the &lt;code&gt;./storage&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;Also, remember that Crawlee, by default, clears its storages before starting a crawler run. This action is taken to prevent old data from interfering with new crawling sessions. In case you need to clear the storages earlier than this, Crawlee provides a handy &lt;code&gt;purgeDefaultStorages()&lt;/code&gt; helper function for this purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Crawlee request queue&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-queue" rel="noopener noreferrer"&gt;request queue&lt;/a&gt; is a storage of URLs to be crawled. It's particularly useful for deep crawling, where you start with a few URLs and then recursively follow links to other pages.&lt;/p&gt;

&lt;p&gt;Each Crawlee project run is associated with a default request queue, which is typically used to store URLs for that specific crawler run.&lt;/p&gt;

&lt;p&gt;To illustrate that, lets go to the &lt;code&gt;routes.js&lt;/code&gt; file in the template we just generated. There you will find the code below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ enqueueLinks, log }) =&amp;gt; { log.info(`enqueueing new URLs`); // Add links found on page to the queue await enqueueLinks({ globs: ['https://crawlee.dev/**'], label: 'detail', });});router.addHandler('detail', async ({ request, $, log, pushData }) =&amp;gt; { const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's take a closer look at the &lt;code&gt;addDefaultHandler&lt;/code&gt; function, particularly focusing on the &lt;code&gt;enqueueLinks&lt;/code&gt; function it contains. The &lt;code&gt;enqueueLinks&lt;/code&gt; function in Crawlee is designed to automatically detect all links on a page and add them to the request queue. However, its utility extends further as it allows us to specify certain options for more precise control over which links are added.&lt;/p&gt;

&lt;p&gt;For instance, in our example, we use the &lt;a href="https://crawlee.dev/api/core/interface/EnqueueLinksOptions#globs" rel="noopener noreferrer"&gt;&lt;strong&gt;globs&lt;/strong&gt;&lt;/a&gt; option to ensure that only links starting with &lt;code&gt;https://crawlee.dev/&lt;/code&gt; are queued. Furthermore, we assign a detail &lt;a href="https://crawlee.dev/api/core/interface/EnqueueLinksOptions#globs" rel="noopener noreferrer"&gt;&lt;strong&gt;label&lt;/strong&gt;&lt;/a&gt; to these links. This labeling is particularly useful as it lets us refer to these links in subsequent handler functions, where we can define specific data extraction operations for pages associated with this label.&lt;/p&gt;

&lt;p&gt;💡 See all the available options for enqueueLinks in the &lt;a href="https://crawlee.dev/api/core/interface/EnqueueLinksOptions#label" rel="noopener noreferrer"&gt;Crawlee documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In line with our discussion on data storage types, we can now find all the links that our crawler has navigated through in the &lt;code&gt;request_queues&lt;/code&gt; storage, located within the crawlers &lt;code&gt;./storage/request_queues&lt;/code&gt; directory. Here, we can access detailed information about each request that has been processed in the request queue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwg9im75xbf1hllzmabt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwg9im75xbf1hllzmabt.png" alt="Request Queue" width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Crawlee request list&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-list" rel="noopener noreferrer"&gt;request list&lt;/a&gt; differs from the request queue as it's not a form of storage in the conventional sense. Instead, it's a predefined collection of URLs for the crawler to visit.&lt;/p&gt;

&lt;p&gt;This approach is particularly suited for situations where you have a set of known URLs to crawl and don't plan to add new ones as the crawl progresses. Essentially, the request list is set in stone once created, with no option to modify it by adding or removing URLs.&lt;/p&gt;

&lt;p&gt;To demonstrate this concept, we'll modify our template to utilize a predefined set of URLs in the request list rather than the request queue. We'll begin with adjustments to the &lt;code&gt;main.js&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;main.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { CheerioCrawler, RequestList } from 'crawlee';import { router } from './routes.js';const sources = [{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },];const requestList = await RequestList.open('my-list', sources);const crawler = new CheerioCrawler({ requestList, requestHandler: router,});await crawler.run();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this new approach, we created a predefined list of URLs, named &lt;code&gt;sources&lt;/code&gt;, and passed this list into a newly established requestList. This requestList was then passed into our crawler object.&lt;/p&gt;

&lt;p&gt;As for the &lt;code&gt;routes.js&lt;/code&gt; file, we simplified it to include just a single request handler. This handler is now responsible for executing the data extraction logic on the URLs specified in the request list.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;routes.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ request, $, log, pushData }) =&amp;gt; { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Following these modifications, when you run your code, you'll observe that only the URLs explicitly defined in our request list are being crawled.&lt;/p&gt;

&lt;p&gt;This brings us to an important distinction between the &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-list" rel="noopener noreferrer"&gt;two types of request storages&lt;/a&gt;. The request queue is dynamic, allowing for the addition and removal of URLs as needed. On the other hand, the request list is static once initialized and is not meant for dynamic changes.&lt;/p&gt;

&lt;p&gt;With the request storage out of the way, lets now explore the result storage in Crawlee, starting with datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Crawlee datasets&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://crawlee.dev/api/types/interface/Dataset" rel="noopener noreferrer"&gt;Datasets&lt;/a&gt; in Crawlee serve as repositories for structured data, where every entry possesses consistent attributes.&lt;/p&gt;

&lt;p&gt;Datasets are designed for append-only operations. This means we can only add new records to a dataset, and altering or deleting existing ones is not an option. Each project run in Crawlee is linked to a default dataset, which is commonly utilized for storing precise results from our web crawling activities.&lt;/p&gt;

&lt;p&gt;You might have noticed that each time we ran the crawler, the folder &lt;code&gt;./storage/datasets&lt;/code&gt; was populated with a series of JSON files containing extracted data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5awwq5f9vrp9l4rlx0un.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5awwq5f9vrp9l4rlx0un.png" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Storing scraped data into a dataset is &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-list" rel="noopener noreferrer"&gt;remarkably simple&lt;/a&gt; using Crawlee's &lt;code&gt;Dataset.pushData()&lt;/code&gt; function. Each invocation of &lt;code&gt;Dataset.pushData()&lt;/code&gt; generates a new table row, with the property names of your data serving as the column headings. By default, these rows are stored as JSON files on your disk. However, Crawlee allows you to integrate other storage systems as well.&lt;/p&gt;

&lt;p&gt;And if you take a closer look at our &lt;code&gt;addDefaultHandler&lt;/code&gt; function in &lt;code&gt;routes.js&lt;/code&gt; you will see just how the &lt;code&gt;pushData()&lt;/code&gt; function was used to append the scraped results to the Dataset.&lt;/p&gt;

&lt;p&gt;For a practical example, lets take another look at the &lt;code&gt;addDefaultHandler&lt;/code&gt; function within &lt;code&gt;routes.js&lt;/code&gt;. Here, you can see how we used &lt;code&gt;pushData()&lt;/code&gt; function to append the scraped results to the Dataset.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;routes.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;router.addDefaultHandler(async ({ request, $, log, pushData }) =&amp;gt; { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Key-value store&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;key-value sto&lt;/a&gt;&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;re in Crawlee i&lt;/a&gt;s &lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;designed for st&lt;/a&gt;oring and retrieving data records or files. Each record is tagged with a unique key and linked to a specific MIME content type. This feature makes it perfect for storing various types of data, such as screenshots, PDFs, or even for maintaining the state of crawlers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Saving screenshots&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To showcase th&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;e flexibility o&lt;/a&gt;f the &lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;key-value stor&lt;/a&gt;e in Crawlee, let's take a screenshot of each page we crawl and save it using Crawlee's key-value store.&lt;/p&gt;

&lt;p&gt;However, to do that, we need to sw&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;itch our crawle&lt;/a&gt;r from CheerioCrawler to PuppeteerCrawler. The good news is that adapting our code to different crawlers is quite straightforward. For this demonstration, we'll temporarily set aside the &lt;code&gt;routes.js&lt;/code&gt; file and concentrate our crawler logic in the &lt;code&gt;main.js&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;To get started with PuppeteerCrawl&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;er, the first s&lt;/a&gt;tep is to install the Puppeteer library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm install puppeteer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, adapt the code in your main.js file as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;main.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Capture the screenshot await saveSnapshot({ key, saveHtml: false }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running the code above, we should see three screenshots, one for each website crawled, pop up on our crawlers &lt;code&gt;key_value_store&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhq2r985ohebqewar73p9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhq2r985ohebqewar73p9.png" width="800" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Saving pages as PDF files&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Suppose we want to convert the page content into a PDF file and save it in the key-value store. This is entirely feasible with Crawlee. Thanks to Crawlee's PuppeteerCrawler being built upon Puppeteer, we can fully utilize all the native features of Puppeteer. To achieve this, we simply need to tweak our code a bit. Here's how to do it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ page, request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Save as PDF await page.pdf({ path: `./storage/key_value_stores/default/${key}.pdf`, format: 'A4', }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to the earlier example involving screenshots, executing this code will create three PDF files, each capturing the content of the accessed websites. These files will then be saved into Crawlees key-value store.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Doing more with your Crawlee scraper&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Thats it for an introduction to Crawlees data storage types. As a next step, I encourage you to take your scraper to the next level by &lt;a href="https://crawlee.dev/docs/introduction/deployment" rel="noopener noreferrer"&gt;deploying it on the Apify platform as an Actor.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With your scraper running on the Apify platform, you gain access to all of Apify's extensive list of features tailored for web scraping jobs, like cloud storage and various data export options. Not sure what it means or how to do it? Dont worry, all the information you need is in this &lt;a href="https://crawlee.dev/docs/deployment/apify-platform" rel="noopener noreferrer"&gt;link to the Crawlee documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crawlee.dev/docs/introduction/deployment" rel="noopener noreferrer"&gt;Deploy your Crawlee scrapers on the Apify platform&lt;/a&gt;&lt;/p&gt;

</description>
      <category>crawlee</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Selenium page object model: what is POM and how can you use it?</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Tue, 26 Sep 2023 22:00:00 +0000</pubDate>
      <link>https://dev.to/percivalvillal3/selenium-page-object-model-what-is-pom-and-how-can-you-use-it-5420</link>
      <guid>https://dev.to/percivalvillal3/selenium-page-object-model-what-is-pom-and-how-can-you-use-it-5420</guid>
      <description>&lt;p&gt;As Selenium projects grow in complexity, maintaining and scaling test scripts can become challenging. This is where the Page Object Model (POM) steps in as a way for Selenium users to write more scalable and readable code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hey, we're&lt;/strong&gt; &lt;a href="https://apify.it/platform-pricing"&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;. The Apify platform gives you access to 1,500+ web scraping and automation tools. Or you can build your own.&lt;/strong&gt; &lt;a href="https://apify.it/platform-pricing"&gt;&lt;strong&gt;Check us out&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is the Page Object Model (POM)?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;At its core, the &lt;a href="https://www.selenium.dev/documentation/test_practices/encouraged/page_object_models/"&gt;Page Object Model (POM)&lt;/a&gt; is a design pattern used in Selenium automation to represent a web application's web pages or components as objects in code. Each web page is associated with a Page Object, and this object encapsulates the page's structure, elements, interactions, and intricacies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--87pgSI6g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/selenium-page-object-model-1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--87pgSI6g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/selenium-page-object-model-1.jpg" alt="Robot parts and tools in workshop with clocktower. Visual metaphor for page object model automated testing of web applications using Selenium and Python." width="800" height="448"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Selenium POM is as finely balanced and intricate as clockwork&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why is POM essential for Selenium automation?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Imagine a scenario where you have a sizable Selenium test suite. Web pages change, elements get updated, and your tests require frequent adjustments. Without POM, managing this can become a nightmare. Test scripts often get cluttered with web element locators and actions, making them difficult to read and maintain. POM addresses these challenges by introducing the concept of Page Objects.&lt;/p&gt;

&lt;p&gt;Think of a Page Object as a blueprint for a web page. It contains methods and properties that allow you to interact with the page's elements (e.g., buttons, text fields, links) and perform actions (e.g., clicking, typing) on them. By creating Page Objects, you achieve a clear separation of concerns: your test scripts focus on test logic, while the Page Objects handle the web page's details.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Advantages of using POM&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Maintainability:&lt;/strong&gt; In large-scale automation projects, web pages often change. Elements get updated, added, or removed. Without a structured approach like POM, maintaining your test scripts becomes a nightmare. POM allows you to isolate changes to Page Objects, making updates more manageable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Readability:&lt;/strong&gt; POM promotes readable and maintainable test scripts. With Page Objects, your tests become more expressive, as you interact with elements using descriptive method names. This improves the overall clarity of your test cases.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reusability:&lt;/strong&gt; Page Objects are reusable components. When multiple tests interact with the same page, you can use the same Page Object in each test. If the page's structure changes, you only need to update the Page Object, not every test case.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; POM scales well with the size of your automation project. As you add more test cases and pages, the structured approach provided by POM keeps your codebase organized and maintainable.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Setting up your environment&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we dive into implementing the Page Object Model (POM) in Selenium, it's crucial to ensure your development environment is properly configured. In this section, we'll cover the necessary prerequisites and guide you through creating a Python project for Selenium automation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Prerequisites&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To get started with Selenium and the Page Object Model, you'll need the following:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python&lt;/strong&gt; : Make sure you have Python installed on your system. You can download the latest version from the &lt;a href="https://www.python.org/"&gt;official Python website&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Selenium&lt;/strong&gt; : Install the Selenium WebDriver library using Python's package manager, pip, by running the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install selenium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Creating a Selenium Project&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once you have the prerequisites in place, you can create a new Python project for your Selenium automation work by following the steps below, or &lt;a href="https://github.com/PerVillalva/selenium-pom-python"&gt;clone the GitHub repository with the final code&lt;/a&gt; for this tutorial.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Create a Project Directory&lt;/strong&gt; : Create a directory in your desired location to store your Selenium project, and then navigate into that directory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Initialize a Python Virtual Environment (Optional)&lt;/strong&gt;: It's a good practice to work within a virtual environment to isolate your project's dependencies. Inside the project directory we created in the previous step, create a virtual environment using the following command:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Install Selenium&lt;/strong&gt; : Inside your virtual environment, install Selenium by running the following command:&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;WebDrivers&lt;/strong&gt; : Selenium requires WebDriver executables for different browsers (e.g., Chrome, Firefox). You'll need to download the WebDriver for your preferred browser and ensure it's accessible from your system's PATH. You can find WebDriver downloads and installation instructions on the &lt;a href="https://www.selenium.dev/documentation/en/webdriver/driver_requirements/"&gt;official Selenium website&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Create Python Files and organize your project&lt;/strong&gt; : To organize our Selenium project, we will create Python files for Page Objects, test scripts, and any additional utilities we might require. We can structure our project by creating directories to categorize these components. This will help us keep our code base clean, easy to understand, and maintainable. As an example, here is the directory structure of the project we will work on during this article:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;project_root/ page_objects/ login_page.py ... test_cases/ base_test.py test_login.py ... utils/ locators.py ... ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HmWtQRNT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HmWtQRNT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-1.png" alt="Selenium POM directory" width="608" height="720"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Great, your environment is now set up and ready for Selenium automation with the Page Object Model. In the upcoming sections, we'll take a deeper look into the practical implementation of POM, starting with creating Page Objects to represent web pages.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Creating Page Objects&lt;/strong&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;What is a Page Object?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A Page Object is a Python class that represents a specific web page or a component of a web page. It encapsulates the structure and behavior of that page, including the web elements (e.g., buttons, input fields) and the actions you can perform on them (e.g., clicking, typing). Page Objects promote code reusability and maintainability by providing a clean and organized way to interact with web elements.&lt;/p&gt;

&lt;p&gt;So lets create our first Page Object:&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1: Define the Page Object class&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Create a Python class for the web page you want to represent. Give it a meaningful name, typically ending with "Page," to indicate its purpose.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# pages/login_page.pyclass LoginPage(object): def __init__ (self, driver): self.driver = driver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we've created a &lt;code&gt;LoginPage&lt;/code&gt; class.&lt;/p&gt;

&lt;p&gt;Our goal will be to implement tests for a &lt;a href="https://practicetestautomation.com/practice-test-login/"&gt;dummy login page&lt;/a&gt; (thanks to &lt;a href="https://www.linkedin.com/in/dmitryshyshkin/"&gt;Dmitry Shyshkin&lt;/a&gt; for the website). We will create tests for three distinct scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Login successful&lt;/strong&gt; : User entered valid credentials.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invalid username&lt;/strong&gt; : User entered an invalid username.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invalid password&lt;/strong&gt; : User entered an invalid password.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lmzFRwAR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lmzFRwAR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-2.png" alt="Selenium POM: test login page" width="800" height="695"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2: Define web elements and actions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now we need a way to access the web elements and actions from within the Page Object class. To keep things organized, we created a separate file under the &lt;code&gt;utils&lt;/code&gt; directory to house all the locators we need:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# utils/locator.pyfrom selenium.webdriver.common.by import Byclass LoginPageLocators(object): USERNAME = (By.ID, 'username') PASSWORD = (By.ID, 'password') SUBMIT = (By.ID, 'submit') ERROR_MESSAGE = (By.ID, 'error')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Here, we've defined the elements &lt;code&gt;USERNAME&lt;/code&gt;, &lt;code&gt;PASSWORD&lt;/code&gt;, &lt;code&gt;SUBMIT&lt;/code&gt; and &lt;code&gt;ERROR_MESSAGE&lt;/code&gt; based on the elements IDs found on the target website.&lt;/p&gt;

&lt;p&gt;Once this is done, we have to import &lt;a href="http://locator.py"&gt;&lt;code&gt;locator.py&lt;/code&gt;&lt;/a&gt; and its contents into the &lt;code&gt;login_&lt;/code&gt;&lt;a href="http://page.py"&gt;&lt;code&gt;page.py&lt;/code&gt;&lt;/a&gt; file.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# login_page.pyfrom utils.locators import *class LoginPage(object): def __init__ (self, driver): # Initialize the LoginPage object with a WebDriver instance. self.driver = driver # Import the locators for this page. self.locator = LoginPageLocators
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3: Implement methods&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Still, within the &lt;code&gt;login_&lt;/code&gt;&lt;a href="http://page.py"&gt;&lt;code&gt;page.py&lt;/code&gt;&lt;/a&gt; file, our task is to define methods that represent the interactions we want to happen on the web page.&lt;/p&gt;

&lt;p&gt;All three previously discussed test cases involve attempting to log into an account. This login process essentially involves entering the username, and password, and then clicking the "Submit" button.&lt;/p&gt;

&lt;p&gt;With these requirements in mind, we can design methods that precisely execute these actions. For example, the &lt;code&gt;enter_username&lt;/code&gt; method locates the username input field and inputs the provided username using the &lt;code&gt;send_keys&lt;/code&gt; function. The other methods in this class follow the same idea:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# login_page.pyfrom utils.locators import *from selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC# Define a class named LoginPage.class LoginPage(object): def __init__ (self, driver): # Initialize the LoginPage instance with a WebDriver object and locators. self.driver = driver self.locator = LoginPageLocators # Define a function to wait for the presence of an element on the page. def wait_for_element(self, element): WebDriverWait(self.driver, 5).until( EC.presence_of_element_located(element) ) # Define a function to enter a username into the corresponding input field. def enter_username(self, username): # Wait for the presence of the username input element. self.wait_for_element(self.locator.USERNAME) # Find the username input element and send the username string to it. self.driver.find_element(*self.locator.USERNAME).send_keys(username) # Define a function to enter a password into the corresponding input field. def enter_password(self, password): # Wait for the presence of the password input element. self.wait_for_element(self.locator.PASSWORD) # Find the password input element and send the password string to it. self.driver.find_element(*self.locator.PASSWORD).send_keys(password) # Define a function to click the login button. def click_login_button(self): # Wait for the presence of the login button element. self.wait_for_element(self.locator.SUBMIT) # Find the login button element and click it. self.driver.find_element(*self.locator.SUBMIT).click() # Define a function to perform a complete login by entering username and password. def login(self, username, password): self.enter_username(username) self.enter_password(password) self.click_login_button() # Define a function to perform a login with valid user credentials. def login_with_valid_user(self): self.login("student", "Password123") # Return a new instance of LoginPage after the login action. return LoginPage(self.driver) # Define a function to perform a login with an invalid username and return the error message. def login_with_invalid_username(self): self.login("student23", "Password123") # Wait for the presence of the error message element. self.wait_for_element(self.locator.ERROR_MESSAGE) # Return the text content of the error message element. return self.driver.find_element(*self.locator.ERROR_MESSAGE).text # Define a function to perform a login with an invalid password and return the error message. def login_with_invalid_password(self): self.login("student", "Password12345") # Wait for the presence of the error message element. self.wait_for_element(self.locator.ERROR_MESSAGE) # Return the text content of the error message element. return self.driver.find_element(*self.locator.ERROR_MESSAGE).text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You might have noticed that the three last methods are a little different. These methods use the high-level login method we defined to perform the login action with the specified username and password combinations. We will soon employ these methods to run tests to evaluate our test cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Writing test cases with POM&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;With the Page Object in place, we can now incorporate it into our test scripts. But first, in the pursuit of maintaining modularity and organization within our code, lets create a &lt;code&gt;base_&lt;/code&gt;&lt;a href="http://test.py"&gt;&lt;code&gt;test.py&lt;/code&gt;&lt;/a&gt; file.&lt;/p&gt;

&lt;p&gt;The purpose of this file is to serve as a repository for all the shared logic used across our tests. By centralizing this logic, we establish a convenient reference point whenever we need to generate new test files.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## base_test.pyimport unittestfrom selenium import webdriver# Define a test class named BaseTest that inherits from unittest.TestCase.class BaseTest(unittest.TestCase): # This method is called before each test case. def setUp(self): # Create a Chrome WebDriver instance. self.driver = webdriver.Chrome() # Navigate to the specified URL. self.driver.get("&amp;lt;https://practicetestautomation.com/practice-test-login/&amp;gt;") # This method is called after each test case. def tearDown(self): # Close the WebDriver, terminating the browser session. self.driver.close()# Check if this script is the main module to be executed.if __name__ == " __main__": # Run the test cases defined in this module unittest.main(verbosity=1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Test case 1: Logging in with valid user credentials&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now that our base test is set up, we can begin developing the logic for our login test.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## test_login.pyfrom tests.base_test import BaseTestfrom pages.login_page import LoginPage# Define a test class named TestLogin that inherits from BaseTest.class TestLogin(BaseTest): # Define the first test method, which tests login with valid user credentials. def test_login_with_valid_user(self): # Initialize a LoginPage object with the self.driver attribute login_page = LoginPage(self.driver) # Call the login_with_valid_user method on the login_page object login_page.login_with_valid_user() # Use self.assertIn to check if the string "logged-in-successfully" # is present in the current URL of the driver. If present, the test passes. self.assertIn("logged-in-successfully", self.driver.current_url)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The defined method &lt;code&gt;test_login_with_valid_user&lt;/code&gt; serves as a test for our initial scenario: logging in using valid user credentials. For the test to succeed, we should see the text "logged-in-successfully" in the URL of the webpage right after submitting our credentials. If thats the case, a positive test feedback message will be printed in our terminal.&lt;/p&gt;

&lt;p&gt;To run the method, type the following command in your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 -m unittest tests.test_login.TestLogin.test_login_with_valid_user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JOBVihn6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JOBVihn6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-3.png" alt="Selenium POM: successful login screen" width="800" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1MaeEMSG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1MaeEMSG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-4.png" alt="Selenium POM: log of test result" width="800" height="155"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Test case 2: Logging in with an invalid username&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;With the method for our first test case out of the way, lets move on to the second scenario: logging in with an invalid username.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## test_login.py# ...# Define the second test method, which tests login with an invalid username. def test_login_with_invalid_username(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_username method on the login_page object. # Assign the result to the variable result (error message). result = login_page.login_with_invalid_username() # Use self.assertIn to check if the string "Your username is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your username is invalid!", result)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The method &lt;code&gt;test_login_with_invalid_username&lt;/code&gt; tests for the second scenario: trying to log in using an invalid username. For the test to succeed, we should see the error message "Your username is invalid!" displayed on the screen right after clicking the Submit button. If thats the case, the test passes.&lt;/p&gt;

&lt;p&gt;To run the method, type the following command in your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m unittest tests.test_login.TestLogin.test_login_with_invalid_username
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--jno28tji--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--jno28tji--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-5.png" alt="Selenium POM: invalid username login screen" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Test case 3: Logging in with an invalid password&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Similar to the previous method, the method checks for a particular error message that should be displayed when the user enters a valid username together with an invalid password. The logic is almost the same, except that, this time, we should expect a different error message to be displayed.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# login_test.py# ...# Define the third test method, which tests login with an invalid password. def test_login_with_invalid_password(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_password method on the login_page object. # Assign the result (error message) to the variable result. result = login_page.login_with_invalid_password() # Use self.assertIn to check if the string "Your password is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your password is invalid!", result)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The method &lt;code&gt;test_login_with_invalid_password&lt;/code&gt; tests for the third scenario: trying to log in using an invalid password. For the test to be successful, we should see the error message "Your password is invalid!" displayed on the screen immediately after clicking the "Submit" button. If this message appears, it signifies a passing test.&lt;/p&gt;

&lt;p&gt;To run the method, type the following command in your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m unittest tests.test_login.TestLogin.test_login_with_invalid_password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--GALC6xSc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--GALC6xSc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-6.png" alt="Selenium POM: invalid password login screen" width="800" height="786"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Running all tests&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now that we have all three methods ready, we may want to execute them all together to test all of our test cases simultaneously. Here is the complete code:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from tests.base_test import BaseTestfrom pages.login_page import LoginPage# Define a test class named TestLogin that inherits from BaseTest.class TestLogin(BaseTest): # Define the first test method, which tests login with valid user credentials. def test_login_with_valid_user(self): # Initialize a LoginPage object with the self.driver attribute, # which is likely a WebDriver instance for interacting with web pages. login_page = LoginPage(self.driver) # Call the login_with_valid_user method on the login_page object, # which is expected to perform a login action with valid credentials. login_page.login_with_valid_user() # Use self.assertIn to check if the string "logged-in-successfully" # is present in the current URL of the driver. If present, the test passes. self.assertIn("logged-in-successfully", self.driver.current_url) # Define the second test method, which tests login with an invalid username. def test_login_with_invalid_username(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_username method on the login_page object. # Assign the result (likely an error message) to the variable result. result = login_page.login_with_invalid_username() # Use self.assertIn to check if the string "Your username is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your username is invalid!", result) # Define the third test method, which tests login with an invalid password. def test_login_with_invalid_password(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_password method on the login_page object. # Assign the result (likely an error message) to the variable result. result = login_page.login_with_invalid_password() # Use self.assertIn to check if the string "Your password is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your password is invalid!", result)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;To run all methods in the &lt;code&gt;TestLogin&lt;/code&gt; class at once, type the following command in your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m unittest tests.test_login.TestLogin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After a few seconds, you should see a similar message displayed on your terminal:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IkgHw5nC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IkgHw5nC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-7.png" alt="Selenium POM: terminal with test message" width="800" height="176"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Handling page navigation and dynamic elements&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In web testing and automation, it's common to encounter scenarios where web pages have dynamic elements, or your test cases require navigation between different pages. The Page Object Model (POM) provides an organized way to handle these challenges.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Handling dynamic elements&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Dynamic elements are elements on a web page that may load or change after the initial page load. Examples include elements that appear after a delay, elements generated via JavaScript, or elements with dynamic IDs or attributes.&lt;/p&gt;

&lt;p&gt;To handle dynamic elements with POM:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Include Dynamic Elements in Page Objects&lt;/strong&gt; : In your Page Object class, include dynamic elements as attributes. You can locate these elements using Selenium locators just like any other element.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use Explicit Waits&lt;/strong&gt; : To ensure that dynamic elements are fully loaded before interacting with them, use Selenium's explicit waits. Explicit waits allow you to wait for specific conditions to be met before proceeding with the test.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's an example of how we used an explicit wait within our login Page Object to enhance the reliability of the tests we've just created:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;## login_page.pyfrom utils.locators import *from selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECclass LoginPage(object): def __init__ (self, driver): self.driver = driver self.locator = LoginPageLocators # Define a function to wait for the presence of an element on the page. def wait_for_element(self, element): WebDriverWait(self.driver, 5).until( EC.presence_of_element_located(element) ) def enter_username(self, username): # Wait for the presence of the username input element. self.wait_for_element(self.locator.USERNAME) self.driver.find_element(*self.locator.USERNAME).send_keys(username)# ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, the &lt;code&gt;wait_for_element&lt;/code&gt; method waits for the element to be present using an explicit wait before running the rest of the code.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Handling page navigation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;With POM, you can encapsulate page navigation within Page Objects, making your test scripts more modular.&lt;/p&gt;

&lt;p&gt;To handle page navigation with POM:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt;  &lt;strong&gt;Include Navigation Methods in Page Objects&lt;/strong&gt; : Create methods within your Page Objects for navigating to other pages. For example, you can have a &lt;code&gt;go_to_dashboard&lt;/code&gt; method in a &lt;code&gt;HomePage&lt;/code&gt; Page Object.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# home_page.pyclass HomePage: # ... def go_to_dashboard(self): self.driver.find_element(*self.dashboard_link).click()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt;  &lt;strong&gt;Reuse Page Objects&lt;/strong&gt; : After navigating to a new page, you can create an instance of the corresponding Page Object to continue interacting with that page. This promotes code reusability and maintains a clear structure.&lt;/p&gt;

&lt;p&gt;Here's an example of navigating from the login page to the dashboard page using Page Objects:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Import necessary Page Objectsfrom login_page import LoginPagefrom home_page import HomePage# ...# Instantiate LoginPage Page Object and perform loginlogin_page = LoginPage(driver)login_page.enter_username('your_username')login_page.enter_password('your_password')login_page.click_login_button()# Instantiate HomePage Page Object after successful loginhome_page = HomePage(driver)# Navigate to the dashboard pagehome_page.go_to_dashboard()# Create a DashboardPage Page Object to interact with the dashboarddashboard_page = DashboardPage(driver)# Perform actions on the dashboard pagedashboard_page.view_orders()dashboard_page.logout()# ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;By encapsulating page navigation and dynamic element handling within Page Objects, you maintain a structured and organized approach to your Selenium automation, making your test scripts more robust and maintainable.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Running tests and reporting&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Running your Selenium tests and generating reports are essential steps in any automation project. So far, weve been using running our tests using &lt;code&gt;unittest&lt;/code&gt; . While Selenium test runners provide basic feedback, it's helpful to generate more informative test reports. We can achieve this by integrating test reporting libraries or frameworks.&lt;/p&gt;

&lt;p&gt;For example, we can use &lt;code&gt;pytest&lt;/code&gt; and the &lt;code&gt;pytest-html&lt;/code&gt; plugin to create basic HTML test reports for better visibility into our automation results.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Generating Basic Test Reports&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt; &lt;code&gt;pytest-html&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pytest-html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Run Tests with&lt;/strong&gt; &lt;code&gt;pytest&lt;/code&gt; and Generate HTML Report:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pytest --html=report.html test_login.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This command will run your tests and generate an HTML report named &lt;code&gt;report.html&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;View the HTML report&lt;/strong&gt; :&lt;/p&gt;

&lt;p&gt;Open the generated HTML report in a web browser to see detailed test results, including passed and failed test cases, error messages, and timestamps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Mn_P-G4w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Mn_P-G4w--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/09/image-8.png" alt="" width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This basic reporting setup provides a visual representation of our test execution, making it easier to identify issues and share results with our team.&lt;/p&gt;

&lt;p&gt;Remember that there are more advanced reporting and test management tools available that you can integrate into your automation framework for more comprehensive reporting, such as Allure, TestNG, or ExtentReports. But thats a topic for another article 😉&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Read more about Selenium&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we've explored the Page Object Model (POM) and how it can make our Selenium automation projects more scalable, readable, and, overall, more professional. But there's much more to Selenium and web automation, so check out our other Selenium posts:&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/selenium-grid-what-it-is-and-how-to-set-it-up/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--zpnCr-EI--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/08/what-is-selenium-grid.jpg" height="546" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/selenium-grid-what-it-is-and-how-to-set-it-up/" rel="noopener noreferrer" class="c-link"&gt;
          Selenium Grid: what it is and how to set it up
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Learn about the Selenium Grid architecture and its use in large test suites, cross-browser testing, and continuous integration.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://res.cloudinary.com/practicaldev/image/fetch/s--q_zdUqT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png" width="128" height="128"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/how-to-handle-iframes-in-selenium/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--PsJ47qF_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w1200/2023/08/handling-iframes-in-selenium-webdriver-jigsaw-illustration.jpg" height="449" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/how-to-handle-iframes-in-selenium/" rel="noopener noreferrer" class="c-link"&gt;
          Selenium WebDriver: how to handle iframes
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Learn how to tackle iframes in Selenium WebDriver. Practical tips for switching frames and interacting with elements.
        &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  &amp;lt;div class="color-secondary fs-s flex items-center"&amp;gt;
      &amp;lt;img
        alt="favicon"
        class="c-embed__favicon m-0 mr-2 radius-0"
        src="https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png"
        loading="lazy" /&amp;gt;
    blog.apify.com
  &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;/div&gt;
&lt;br&gt;
.


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-with-selenium-and-python/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--PA1v0SAR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w1200/2023/03/web-scraping-with-selenium-and-python.jpg" height="533" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-with-selenium-and-python/" rel="noopener noreferrer" class="c-link"&gt;
          Web scraping with Selenium and Python
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          A guide to web scraping in Selenium with code examples.
        &lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  &amp;lt;div class="color-secondary fs-s flex items-center"&amp;gt;
      &amp;lt;img
        alt="favicon"
        class="c-embed__favicon m-0 mr-2 radius-0"
        src="https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png"
        loading="lazy" /&amp;gt;
    blog.apify.com
  &amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;/div&gt;
&lt;br&gt;
.
&lt;/div&gt;
&lt;/div&gt;

</description>
      <category>selenium</category>
      <category>python</category>
      <category>aytomation</category>
    </item>
    <item>
      <title>6 things you should know before buying or building a web scraper</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Tue, 22 Aug 2023 22:00:00 +0000</pubDate>
      <link>https://dev.to/percivalvillal3/6-things-you-should-know-before-buying-or-building-a-web-scraper-epg</link>
      <guid>https://dev.to/percivalvillal3/6-things-you-should-know-before-buying-or-building-a-web-scraper-epg</guid>
      <description>&lt;p&gt;We've been &lt;a href="https://apify.com/web-scraping"&gt;scraping the web at Apify&lt;/a&gt; for almost eight years now. We've built our &lt;a href="https://apify.com/"&gt;cloud platform&lt;/a&gt;, a popular &lt;a href="https://crawlee.dev/"&gt;open-source web scraping library&lt;/a&gt;, and hundreds of web scrapers for companies large and small. Thousands of developers from all over the world use our tech to build reliable scrapers faster, and many even sell them on &lt;a href="https://apify.com/store"&gt;Apify Store&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But we also failed countless times. We lost very large customers to dumb mistakes, and we struggled with unlocking value for many of our users thanks to misaligned expectations. Perhaps we were naive, or just busy with building a startup, but we often forgot to realize that our customers were not experts in web scraping and that the things we thought obvious, were very new and unexpected to them.&lt;/p&gt;

&lt;p&gt;The following 6 things you should know before buying or building a web scraper are a concentrated summary of what we've learned over the years and what we should've been telling our customers from Apifys day one.&lt;/p&gt;

&lt;p&gt;If you dont have a lot of experience with web scraping, you might find some of these things unexpected, or even shocking, but trust me, it's better to be shocked now, than two months later, when your expensive scraper suddenly stops working.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Every website is different&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Even though, to the human eye, a website looks like a website, underneath the buttons, images, and tables they're all very different. That makes it hard to estimate how long a web scraping project will take, or how expensive it will be, before taking a thorough look at the target websites.&lt;/p&gt;

&lt;p&gt;With regular web applications, the complexity of a project is determined by your requirements and the features you need. With web scrapers, it's driven mostly by the complexity of the target website, which you have no control over. To determine the features the scraper will need to have, and also to identify potential roadblocks, developers must first analyze the website.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Common factors of web scraper complexity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Some of the most important factors that influence the cost and time to completion of a web scraping project are the following:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Anti-scraping protections&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Even though web scraping is perfectly legal, websites often try to block traffic they identify as coming from a web scraper. It's therefore essential for the scraper to appear as human-like as possible. This can be achieved using headless browsers and clever obfuscation techniques, but many of them increase the price of a project by orders of magnitude. A good initial analysis will identify the protections and provide a cost-efficient plan for overcoming them. Great web scraper developers are already familiar with most of the protections out there and they can reliably &lt;a href="https://blog.apify.com/crawl-without-getting-blocked/"&gt;avoid being blocked&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There are ready-made APIs on the market that promise to overcome almost any protection and blocking. Frankly, they're often quite good at it, and they make a lot of sense when you need results immediately, or when you're looking at low volumes. But at high volumes and for recurring use cases, your total cost of ownership will skyrocket. They're a great tool, but should not be used blindly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Architecture of the website&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Some websites can be scraped very quickly and cheaply with simple HTTP requests and HTML or JSON parsing. Other websites require a &lt;a href="https://blog.apify.com/headless-browsers-what-are-they-and-how-do-they-work/"&gt;headless browser&lt;/a&gt; to access their data. Headless browser scrapers need a lot of CPU power and memory to operate, which makes them 10-20 times more expensive to run than HTTP scrapers. A great web scraper developer will always try to find a way to use an HTTP scraper by reverse engineering the website's architecture, but unfortunately, it's not always possible.&lt;/p&gt;

&lt;p&gt;Website updates and redesigns are out of your control and if an update introduces a new anti-scraping protection or changes the architecture, the costs of scraping may change dramatically. Or not at all. Sadly, you never know. The good news is that large website upgrades are fairly rare.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How fast and often do you need the data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Speed and frequency of web scraping have a profound impact on complexity. The faster you scrape, the more difficult it is to appear like a human user. Not only do you need more IP addresses, and more device fingerprints, but with super &lt;a href="https://blog.apify.com/what-is-large-scale-scraping-and-how-does-it-work/"&gt;large-scale scraping&lt;/a&gt;, you also have the non-trivial engineering overhead of managing and synchronizing tens or hundreds of concurrently running web scrapers.&lt;/p&gt;

&lt;p&gt;Finally, there's the issue of overloading the target website. &lt;a href="http://Amazon.com"&gt;Amazon.com&lt;/a&gt; can handle significantly more scraping than your local dentist's website, but in many cases, it's not easy to figure out how much traffic a website can handle safely if you want to &lt;a href="https://blog.apify.com/what-is-ethical-web-scraping-and-how-do-you-do-it/"&gt;scrape ethically&lt;/a&gt;. Great web scraper developers know how to pace their scrapers. When they make a mistake, they can quickly identify it and immediately downscale the scraping operation.&lt;/p&gt;

&lt;p&gt;Apify provides web scraper developers with sophisticated tooling that helps them analyze websites quickly, overcome anti-scraping protections, and deploy changes nearly instantly at any scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Websites change without asking&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Web scrapers are different from traditional software applications because even the best web scraper can stop working at any moment, without warning. The question is not if, but when.&lt;/p&gt;

&lt;p&gt;Web scrapers break because they are programmed to understand the structure of the websites they visit, and if the structure changes, the scraper will no longer be able to find the data it's looking for. Sometimes a human can't even spot the difference, but any time the website's HTML structure, APIs, or other components change, it can cause a scraping disruption.&lt;/p&gt;

&lt;p&gt;Professional web scraper programmers can reduce the chances of a scraper breaking, but never to zero. When a brand launches a full website redesign, the web scraper will need to be programmed again, from scratch. Reliable web scraping therefore requires &lt;a href="https://blog.apify.com/why-you-need-to-monitor-long-running-large-scale-scraping-projects/"&gt;constant monitoring of the target websites&lt;/a&gt; and of the scraper's performance.&lt;/p&gt;

&lt;p&gt;Some types of websites, like e-commerce stores or news sites, can be scraped using AI-enabled web scrapers that handle website changes automatically. But just like any AI, their results are only 80-90% correct. You have to decide if that's enough for your project or not.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How can you handle website changes?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;First, think about how critical the data is for you on a scale from 1 to 3, with 1 being nice-to-have and 3 mission-critical. Think both in terms of how fast you need the data as well as about its importance. Do you need the data for a one-off data analysis? A monthly review? A real-time notification system? If you can't get the data in the expected quality, can you postpone the analysis, or will your production systems and integrations fail?&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1: Nice-to-have data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;That's data you can either wait for or dont care about missing out on. For those projects, it's best to simply accept as a fact that the websites can change before your next scrape and that you might have to fix the scraper. Regular maintenance often does not make sense, because it adds extra recurring costs that could exceed the price of a new scraper. The best course of action is to try if the scraper still works, and if it doesn't, order an update or fix it yourself. This might take a few hours, days, or weeks, depending on the complexity of the scraper and how much money you want to spend.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2: Business-critical data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is data that's important for your project, but not having it at the right time and in the right quality won't threaten the existence of the project itself. For example, data for a weekly competitive pricing analysis. You won't be able to price as efficiently without it, but your business will continue. Most web scraping projects fall into this category. Here the best practice is to set up a monitoring system around the scrapers and to have developers ready to start fixing issues right away.&lt;/p&gt;

&lt;p&gt;The monitoring system must serve two functions. First, it must notify you in real-time when something is wrong. Some scrapers can run for hours or days, and you dont want to wait until the end of the scrape to learn that all your data is useless. Second, it must provide detailed information about what is wrong. Which pages failed to be scraped, how many items are invalid, which data points are missing, and so on.&lt;/p&gt;

&lt;p&gt;A robust monitoring system gives developers an early warning and valuable information to fix the web scraper as soon as possible. Still, if you dont have developers at hand to start debugging right away, the monitoring itself can't save you. It doesn't matter if you source your developers in-house or from a vendor, but ideally, you should have a dedicated developer (or a team) ready to jump in within a matter of hours or a day. Any experienced developer will do, but if you want a quick and reliable fix, you need a developer that's familiar with your scraper and the website it's scraping. Otherwise, they'll spend most of their time learning how the scraper works, which dramatically increases the cost of the update.&lt;/p&gt;

&lt;p&gt;A good monitoring system and dedicated developers come at a price (internal or external), but from our experience, they are necessary to ensure the reliable operation of business-critical web scrapers. It's similar to an insurance policy. A small regular payment, instead of risking a high-cost incident down the line. Dont worry though. You definitely dont need one full-time dev per scraper. Just someone who knows the project and can jump in at short notice to do a few hours of work.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3: Mission-critical data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This category includes data that your business simply can't live without, but also high-frequency analytics data, that have strict timing requirements. For example, data that needs to be scraped every day, and also delivered the same day before 5 a.m. When scraping the data usually takes 2 hours, the scraper starts at midnight, and your monitoring system reports that the scraper is broken, Who's going to fix it between 3 a.m. and 5 a.m.?&lt;/p&gt;

&lt;p&gt;The best practice for high-quality mission-critical data is regular monitoring and maintenance of the scrapers, proactive testing, and performance analytics. Many issues with web scrapers can be caught early by regular health checks. Those are small scrapes that test various features of the target website. Depending on the requirements of the project, they can run every day, hour, or minute, and will give developers the earliest possible warning about issues they need to investigate.&lt;/p&gt;

&lt;p&gt;Apify includes a robust monitoring system with all plans. You can monitor scraper run times, costs, numbers of results, and many other metrics, and get notified immediately when your thresholds are missed. For customers of our &lt;a href="https://apify.com/enterprise"&gt;professional services&lt;/a&gt;, Apify offers SLAs with monitoring, dedicated developer capacity, and guaranteed uptime.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Small changes in web scraper specifications can cause dramatic changes in cost&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This applies to all software development, but maybe even more so to web scrapers, because the architecture of the scraper is largely dictated by the architecture of the target website, and not by your software architects. Let's illustrate this with an example.&lt;/p&gt;

&lt;p&gt;You want to build a web scraper that collects product information such as name, price, description, and stock availability from an e-commerce store. Your developer then finds that there are 1,000,000 products in the store. To scrape the information you need, the scraper has to visit ~1,000,000 pages. Using a simple HTTP crawler this will cost you, let's say, $100. Assuming you want fresh data weekly, the project will cost you ~$400 a month.&lt;/p&gt;

&lt;p&gt;Then you think it would be great to also get the product reviews. But to get all of them, the scraper needs to visit a separate page for each product. This doubles the scraping cost because the scraper must visit 2 million pages instead of 1 million. That's an extra ~$400 for a total of ~$800 a month.&lt;/p&gt;

&lt;p&gt;Finally, you realize that you would also like to know the delivery cost estimate that's displayed right under the price. Unfortunately, this estimate is computed dynamically on the page using a third-party service, and your developer tells you that you will need a headless browser to do the computation. This increases the price of scraping product details ~20 times. From $100 to $2,000.&lt;/p&gt;

&lt;p&gt;In total, those two relatively small adjustments pushed the monthly price of scraping from ~$400 to ~$8,400.&lt;/p&gt;

&lt;p&gt;The best approach to avoid costly surprises like that is thoroughly clarifying your requirements, both internally and externally. Focus on the outcomes you seek, rather than the features. &lt;a href="https://apify.com/enterprise"&gt;Experienced Apify consultants&lt;/a&gt; can help you prepare a great specification that will deliver the outcomes at the most efficient price point.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. There &lt;em&gt;are&lt;/em&gt; legal limits to what you can scrape&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Even though web scraping is perfectly legal, there are rules and regulations every web scraper must follow. In short, you need to be careful when you scraping:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;personal data (emails, names, photos of people, birthdates )&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;copyrighted content (videos, images, news, blog posts )&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;data thats only available after signing up (accepting terms of use)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Web scraper consultants and developers can give you guidance on your project, but they can't replace professional advice from your local lawyer. Laws and regulations are very different across the world. Web scraping professionals have seen a fair share of projects, and they can reasonably guess whether your own project will be regulated or not, but they're not globally certified lawyers.&lt;/p&gt;

&lt;p&gt;If you want to learn more, I have written an &lt;a href="https://blog.apify.com/is-web-scraping-legal/"&gt;extensive guide that covers the legality of web scraping&lt;/a&gt;. It includes detailed explanations of the above categories of data, up-to-date case law, and actionable tips that will help you decide whether you could benefit from talking to a lawyer.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Start with a proof of concept for your web scraper&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Even though you might want to kick off your new initiative and reap the benefits of web data at full scale right away, I strongly recommend starting with a proof of concept or an MVP.&lt;/p&gt;

&lt;p&gt;As I explained earlier, all web scraping projects venture into uncharted territory. Websites are controlled by third parties and they dont guarantee any sort of uptime or data quality. Sometimes you'll find that they're grossly over-reporting the number of available products. Other times the website changes so often, that the web scraper maintenance costs become unbearable. Remember Twitter (now X) and their shenanigans with the public availability of tweets.&lt;/p&gt;

&lt;p&gt;The inherent unpredictability of web scraping can be mitigated by approaching it as an R&amp;amp;D project. Build the minimal first version, learn, and iterate. If you're looking to scrape 100 competitors, start by validating your ideas on the first 5. Choose the most impactful ones or the ones that your developers view as the most difficult to scrape. Make sure the ROI is there, and empowered by the learnings, start building the next batch of websites. You will get better results, faster, and at a more competitive price this way I promise.&lt;/p&gt;

&lt;p&gt;The Apify team recommends starting with a PoC on all projects. Even high-profile customers are often uncertain about the outcomes web scraping can bring to their organizations. Starting small helps them get buy-in from key stakeholders and onboard their teams properly. The customers also appreciate the flexibility a PoC allows, because they can start seeing results in a matter of days or weeks, and if something doesn't add up on their end, they can quickly pivot the project or request changes in the specification.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Prepare for turning data into insights&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Mining companies are experts in mining ore. Web scraping companies and developers are experts in mining data. And just like mining companies aren't the best vendors for building bridges or car engines using the mined ore, web scraping companies often have limited experience with banking, automotive, fashion, or any other complex business domain.&lt;/p&gt;

&lt;p&gt;Before you buy or build a web scraper, you must ask yourself whether your vendor or your team has the skills and the capacity to turn the raw data into actionable insights. Web scraping is only the first part of the process that unlocks new business value.&lt;/p&gt;

&lt;p&gt;It happened from time to time to our customers that they simply weren't ready to process the vast amount of data web scraping offered. This led to sunk costs and the downscaling of their projects over time. They had the data, but they could not find the insights, which led to poor ROI.&lt;/p&gt;

&lt;p&gt;At Apify we actively ask our customers about their domain expertise expectations before starting any custom project. In situations where we miss the relevant skills in our team, we transparently leverage our partners. They specialize in specific domains like competitive intelligence, natural language processing, or web application development.&lt;/p&gt;

&lt;p&gt;We also require the customers of our professional services to dedicate internal resources to the project. Without that, it's unlikely that a web scraping project will succeed in the long term.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Anything else?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Yes, there are about a million things that can go wrong in a web scraping project. But that's true in any field of human activity. Whether you choose to develop with &lt;a href="https://crawlee.dev/"&gt;open-source tools&lt;/a&gt;, use &lt;a href="https://apify.com/store"&gt;ready-made web scrapers&lt;/a&gt; or buy a &lt;a href="https://apify.com/enterprise"&gt;fully-managed service&lt;/a&gt;, a little due diligence will go a long way. And if you understand and follow the 6 recommendations above, I'm confident that your web scrapers will be set up for long-term success.&lt;/p&gt;

</description>
      <category>webscraping</category>
    </item>
    <item>
      <title>Selenium Grid: what it is and how to set it up</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Wed, 02 Aug 2023 22:00:00 +0000</pubDate>
      <link>https://dev.to/apify/selenium-grid-what-it-is-and-how-to-set-it-up-29mg</link>
      <guid>https://dev.to/apify/selenium-grid-what-it-is-and-how-to-set-it-up-29mg</guid>
      <description>&lt;p&gt;Explore Selenium Grid use cases in large test suites, cross-browser testing, and continuous integration. Check the steps for setting up Selenium Grid and practical tips for efficient parallel test execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is Selenium Grid?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Selenium Grid is a powerful tool that enhances the efficiency of Selenium test automation by allowing tests to be executed in parallel across multiple machines and web browsers. It acts as a test execution environment where tests can be distributed and run on various Selenium Grid Nodes simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What are the benefits of using Selenium Grid?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Its distributed testing capability makes Selenium Grid an invaluable resource for reducing test execution time and achieving faster feedback in the development cycle.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reduced test execution time:&lt;/strong&gt; with parallel test execution, Selenium Grid significantly reduces the time required to execute test suites, as multiple tests run concurrently on different Nodes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Improved test coverage:&lt;/strong&gt; Selenium Grid enables testing on various browser and operating system combinations, ensuring better test coverage and identifying cross-browser compatibility issues.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost-effective:&lt;/strong&gt; by leveraging existing infrastructure and reusing test scripts, Selenium Grid helps optimize resource utilization, making it cost-effective for large-scale test automation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Efficient test feedback:&lt;/strong&gt; faster test execution and parallelization provide quicker feedback to developers, enabling them to identify and fix issues promptly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Selenium Grid's distributed architecture allows easy scaling by adding more Nodes, accommodating increased testing demands as projects grow.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;When to use Selenium Grid?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Selenium Grid becomes particularly advantageous in scenarios where its distributed testing capabilities can significantly enhance test automation efficiency and effectiveness.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Large test suites and parallel execution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When dealing with extensive test suites that take a long time to execute sequentially, Selenium Grid can parallelize test execution across multiple Nodes. This dramatically reduces the overall test execution time and provides faster feedback to the development team.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cross-browser and cross-platform testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To ensure your web application functions correctly across different browsers and operating systems, Selenium Grid allows you to execute tests on a variety of browser configurations concurrently. This helps identify compatibility issues early in the development process.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Scaling test infrastructure&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As your test automation requirements grow, Selenium Grid facilitates horizontal scaling by adding more nodes to the grid. This scalability ensures that your test infrastructure can accommodate increased testing demands without sacrificing execution speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Continuous integration (CI) pipelines&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In CI/CD pipelines, where frequent code changes trigger automated testing, Selenium Grid's parallel execution capability becomes indispensable. It allows you to execute multiple tests simultaneously on various Nodes, speeding up the testing process and ensuring rapid feedback.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Geographically distributed testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When your application caters to users from different regions, Selenium Grid can set up Nodes on geographically distributed machines. This approach allows you to verify the application's functionality and performance in different network conditions and locations.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Example scenario: e-commerce website testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Consider an e-commerce website with a large number of test cases to validate its functionalities. Without Selenium Grid, running these tests sequentially could take hours. However, by leveraging Selenium Grid, we can divide the test suite across multiple Nodes, each capable of testing on different browser and OS combinations. As a result, we can significantly reduce test execution time, enabling faster feedback for developers and testers.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Selenium Grid architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Selenium Grid's architecture is designed to facilitate parallel test execution across multiple Nodes, enabling efficient distribution of test cases and providing faster results. The architecture consists of two main components: the Hub and the Nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://www.selenium.dev/documentation/legacy/selenium_3/grid_components/#hub" rel="noopener noreferrer"&gt;&lt;strong&gt;The Hub&lt;/strong&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;The Selenium Grid Hub serves as the central control point for test execution. It receives test requests from clients (test scripts) and manages the distribution of these tests to available Nodes. The Hub acts as a mediator between clients and Nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://www.selenium.dev/documentation/legacy/selenium_3/grid_components/#nodes" rel="noopener noreferrer"&gt;&lt;strong&gt;Nodes&lt;/strong&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Nodes are the execution environments where tests run. Each Node registers itself with the Hub, indicating its availability for test execution. Nodes can be configured with various browser and OS combinations, offering a diverse testing environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Communication flow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;A test script (client) requests a new session from the Hub by specifying the desired browser and platform (e.g., Chrome on Windows).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Hub examines its registry of available Nodes and forwards the test request to an appropriate Node capable of fulfilling the desired capabilities.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The selected Node launches the specified browser with the desired configuration and establishes a new WebDriver session.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The test script communicates with the browser via the WebDriver session on the Node for test execution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Test results and status are reported back to the Hub, which forwards them to the client.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Load balancing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Selenium Grid employs a load balancing mechanism to ensure efficient utilization of available Nodes. When multiple Nodes with similar desired capabilities are present, the Hub distributes test requests across these Nodes, optimizing resource usage and reducing test execution time.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Handling failures&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Selenium Grid provides robust mechanisms to handle Node failures gracefully. If a Node becomes unresponsive during test execution, the Hub reassigns the affected test cases to other available Nodes, ensuring that the overall test suite continues running smoothly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Web browser drivers&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Each Node in Selenium Grid must have the corresponding web browser driver installed. For example, if a Node is configured to run tests on Chrome, it should have the ChromeDriver installed and properly configured.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to set up Selenium Grid&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Prerequisites&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Java 11 or higher installed (&lt;a href="https://www.java.com/en/download/" rel="noopener noreferrer"&gt;download link&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Browser(s) installed (e.g., Chromium, Firefox, Safari)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Browser drivers (e.g., ChromeDriver, GeckoDriver)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add the browsers location to the system &lt;strong&gt;PATH&lt;/strong&gt; or place them in a directory accessible to the Python scripts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Download the Selenium Server &lt;code&gt;jar&lt;/code&gt; file from the &lt;a href="https://github.com/SeleniumHQ/selenium/releases/latest" rel="noopener noreferrer"&gt;latest release&lt;/a&gt; (at the time of writing of this article, the latest release was &lt;a href="https://github.com/SeleniumHQ/selenium/releases/download/selenium-4.10.0/selenium-server-4.10.0.jar" rel="noopener noreferrer"&gt;Selenium Server version 4.10.0&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Start Selenium Grid Hub&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Open a terminal or command prompt and run the following command to start the Selenium Grid Hub, making sure to start the grid from the same folder where the &lt;code&gt;jar&lt;/code&gt; file is located:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;java -jar selenium-server-&amp;lt;version&amp;gt;.jar hub
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;👉 If you have downloaded a jar file from a different version of Selenium Server, replace the  in the code with your own.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once the command finishes running, we will receive a message indicating that the Hub has been successfully started:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhkv5cw7s4urzuwm7y3zp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhkv5cw7s4urzuwm7y3zp.png" alt="Selenium Grid Hub Local Host" width="800" height="93"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example of message that Hub has been started&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To ensure that everything worked as intended, visit the local URL where the Selenium Grid Hub started. Since we have not registered any Nodes yet, you should see a screen similar to the one below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftdsmh6mf7rtzlu9ouh5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fftdsmh6mf7rtzlu9ouh5.png" alt="Selenium Grid Hub: visiting local URL" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Make sure that everything works correctly by visiting the local URL&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Add Nodes to the Hub&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;At startup, the Node scans the System PATH to identify and make use of the available drivers. This allows the Node to access the necessary browser drivers for test execution. Note that the provided command assumes that both the Node and the Hub are running on the same machine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;java -jar selenium-server-&amp;lt;version&amp;gt;.jar node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can register multiple Nodes with different desired capabilities to test on various browser configurations. For example, we can register an additional Node with the &lt;code&gt;port 6666&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;java -jar selenium-server-4.10.0.jar node --port 6666
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Verify Hub and Nodes&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To verify what Nodes we have open on our Hub, we can open a web browser and navigate to &lt;a href="https://www.selenium.dev/documentation/legacy/selenium_3/grid_components/#hub" rel="noopener noreferrer"&gt;&lt;code&gt;http://localhost:4444/grid/ui&lt;/code&gt;&lt;/a&gt;. This will display the Selenium Grid console, showing the registered Hub and Nodes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhyjyxhqi1idr5vcwhb6p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhyjyxhqi1idr5vcwhb6p.png" alt="Selenium Grid Nodes" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Verify what Nodes are open on the Hub&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Running tests using Selenium Grid&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now, let's create a simple test script in Python to demonstrate how to run tests using Selenium Grid. In this example, we will use the Selenium WebDriver with Python to open &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt; and extract the text content of its title, and description on multiple browsers simultaneously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from selenium import webdriver
from selenium.webdriver.common.by import By

# Define the URL for the Selenium Grid hub
hub_url = '&amp;lt;http://192.168.1.221:4444&amp;gt;'

# Create browser options for Chrome
chrome_options = webdriver.ChromeOptions()

# Create browser options for Firefox
firefox_options = webdriver.FirefoxOptions()

# Connect to the Selenium Grid hub and create a remote WebDriver instance

# Chrome
driver_chrome = webdriver.Remote(
    command_executor=hub_url,
    options=chrome_options
)

driver_chrome.get("&amp;lt;https://apify.com/store&amp;gt;")

chrome_data = {
    "page_title": driver.find_element(By.CSS_SELECTOR, "header &amp;gt; div &amp;gt; h1").text,
    "page_description": driver.find_element(By.CSS_SELECTOR, "header &amp;gt; div &amp;gt; p").text
}

print(chrome_data)

# Firefox
driver_firefox = webdriver.Remote(
    command_executor=hub_url,
    options=firefox_options
)

driver_firefox.get("&amp;lt;https://apify.com/store&amp;gt;")

firefox_data = {
    "page_title": driver.find_element(By.CSS_SELECTOR, "header &amp;gt; div &amp;gt; h1").text,
    "page_description": driver.find_element(By.CSS_SELECTOR, "header &amp;gt; div &amp;gt; p").text
}

print(firefox_data)

driver.quit()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code snippet above demonstrates a test scenario where the page title and description are extracted from the Apify Store website using both Chrome and Firefox browsers.&lt;/p&gt;

&lt;p&gt;With this code, we can run the same test on different browsers concurrently by creating separate WebDriver instances for each browser and connecting them to our Selenium Grid Hub.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Handling test failures&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In distributed testing scenarios, it is crucial to handle test failures efficiently. If a Node becomes unresponsive during test execution, the Hub automatically reassigns the affected test cases to other available Nodes, ensuring the overall test suite continues running smoothly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Parallel test execution tips&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Organize test cases in a way that allows for easy parallelization and avoids dependencies between tests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Divide your test suite into smaller chunks to distribute the load evenly across Nodes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consider setting up a dedicated test infrastructure for Selenium Grid to ensure stability and optimal performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're not sure that Selenium is the right testing framework for you, check out this detailed post on &lt;a href="https://blog.apify.com/cypress-vs-selenium/" rel="noopener noreferrer"&gt;Cypress vs. Selenium&lt;/a&gt;. Or find out whether Selenium is the best choice for &lt;a href="https://apify.com/web-scraping" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt; in &lt;a href="https://blog.apify.com/playwright-vs-selenium-webscraping/" rel="noopener noreferrer"&gt;Playwright vs. Selenium&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>selenium</category>
      <category>testing</category>
    </item>
    <item>
      <title>Python and machine learning</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Sun, 30 Jul 2023 22:00:00 +0000</pubDate>
      <link>https://dev.to/apify/python-and-machine-learning-174b</link>
      <guid>https://dev.to/apify/python-and-machine-learning-174b</guid>
      <description>&lt;p&gt;Learn how Python and machine learning intersect to solve complex problems that defeat traditional programming methods. Find out about Pandas, TensorFlow, Scikit-learn, and how they can transform data.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is machine learning?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Machine learning, a subset of artificial intelligence (AI), is a rapidly evolving field with numerous practical applications in various domains. Recently, the popularity and impact of AI, exemplified by advancements like &lt;a href="https://blog.apify.com/gpt-scraper-chatgpt-access-internet/" rel="noopener noreferrer"&gt;ChatGPT&lt;/a&gt;, have boosted interest in the field and its potential to enhance our daily lives. But what exactly is machine learning and when would we want to use it? And how does Python fit in with machine learning?&lt;/p&gt;

&lt;p&gt;To answer these questions, let's consider an example to understand its significance. Imagine you're tasked with developing a program to analyze an image and determine whether it contains a cat, a dog, or another animal. To accomplish such a broad task, traditional programming techniques would quickly lead to overwhelming and time-consuming complexity. Devising multiple rules to detect curves, edges, and colors in the image would be prone to flaws. For example, black-and-white photos would require rule revisions, and unanticipated angles of cats or dogs would make any rules we create ineffective. In other words, attempting to solve this problem through traditional programming methods would prove excessively complicated or even impossible.&lt;/p&gt;

&lt;p&gt;And this is where &lt;a href="https://blog.apify.com/what-is-machine-learning-doing-for-us/" rel="noopener noreferrer"&gt;machine learning&lt;/a&gt; comes into play. It offers a technique for us to address such problems effectively. Instead of relying on explicit programming rules, we can construct a model or an engine and provide it with an abundance of data. For instance, to solve our dogs and cats problem, we could supply thousands or even tens of thousands of pictures of cats and dogs to a model that would then analyze this input data and learn its patterns autonomously.&lt;/p&gt;

&lt;p&gt;Now, suppose we present the model with a new, unseen picture of a cat and inquire whether the picture depicts a cat, a dog, or a horse. The model, based on its learned patterns, will provide us with a response, accompanied by a certain level of accuracy. The &lt;a href="https://apify.com/data-for-generative-ai" rel="noopener noreferrer"&gt;more data we feed into the model&lt;/a&gt; the better its accuracy becomes, especially if the data is relevant and high quality.&lt;/p&gt;

&lt;p&gt;Although this example is simplistic, machine learning has extensive applications, including self-driving cars, robotics, natural language processing, image recognition, and forecasting, such as predicting stock market trends or weather patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How Python and machine learning come together&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;That all sounds great, but what can we use to build those models? While there is no single best programming language for machine learning, Python has emerged as the de facto language for machine learning due to its simplicity, flexibility, and vibrant ecosystem of libraries and tools.&lt;/p&gt;

&lt;p&gt;In this article, we will explore the &lt;a href="https://blog.apify.com/what-are-the-best-python-web-scraping-libraries/" rel="noopener noreferrer"&gt;best Python libraries&lt;/a&gt; for developing machine-learning models, such as Pandas, TensorFlow, Scikit-learn, and more, to understand their role in the various stages of the machine-learning process.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5 steps in developing a machine learning model with Python&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Developing a machine learning model involves several essential steps that collectively form a pipeline from data preparation to model deployment. Understanding these steps is crucial for building effective and accurate machine-learning models. Let's take a quick look at each step and what popular Python libraries we could use to fulfill the requirements of each step:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Data preparation and exploration&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Data preparation and exploration lay the foundation for any successful machine-learning project. This step involves tasks such as &lt;strong&gt;data cleaning&lt;/strong&gt; , &lt;strong&gt;handling missing values&lt;/strong&gt; , &lt;strong&gt;feature scaling&lt;/strong&gt; , and &lt;strong&gt;data visualization&lt;/strong&gt;. Properly preparing and exploring the data can help identify patterns, outliers, and relationships that will influence the model's performance.&lt;/p&gt;

&lt;p&gt;To accomplish this step, we can leverage libraries such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://pandas.pydata.org/docs/" rel="noopener noreferrer"&gt;&lt;strong&gt;Pandas&lt;/strong&gt;&lt;/a&gt;: In the context of machine learning, Pandas is a crucial tool for handling and analyzing structured data. By leveraging its powerful data structures, such as DataFrames, we can efficiently manipulate and transform datasets. To that end, Pandas provides an extensive range of functions for data cleaning, handling missing values, and performing descriptive statistics. These capabilities are crucial in the data preparation phase of machine learning, enabling us to preprocess the data, remove outliers, impute missing values, and extract meaningful insights.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://matplotlib.org/stable/index.html" rel="noopener noreferrer"&gt;&lt;strong&gt;Matplotlib&lt;/strong&gt;&lt;/a&gt;: As a widely-used plotting library, Matplotlib offers a versatile set of visualization techniques, including line plots, scatter plots, and histograms. These visualizations are invaluable in the, help researchers identify patterns, trends, and anomalies in the dataset in the data exploration phase. By visualizing the data, machine learning practitioners can make informed decisions about feature engineering, data preprocessing, and model selection.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;❗ The code examples provided in this article are for demonstration and educational purposes only and should not be considered production-ready.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To get an idea of how we would go about this step, let's consider a situation where we use Pandas to explore and visualize data retrieved from a CSV file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

# Load the dataset
data = pd.read_csv('sample_data.csv')

# Explore the data
print(data.head()) # Display the first few rows
print(data.describe()) # Get statistical summary
print(data.info()) # Get information about the columns

# Handle missing values
data = data.fillna(0) # Replace missing values with 0

# Visualize the data
data['age'].plot.hist() # Plot a histogram of the age column
data.plot.scatter(x='income', y='purchase') # Create a scatter plot of income vs. purchase

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To obtain high-quality datasets for machine learning, there are several options available. One approach is to download existing datasets from machine learning communities like &lt;a href="https://www.kaggle.com/" rel="noopener noreferrer"&gt;Kaggle&lt;/a&gt;, where you can find a wide range of &lt;a href="https://www.kaggle.com/datasets" rel="noopener noreferrer"&gt;datasets for free&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Alternatively, if you require a dataset tailored to your specific project, web scraping can be an effective solution. &lt;a href="https://apify.com/web-scraping" rel="noopener noreferrer"&gt;Web scraping&lt;/a&gt; platforms like &lt;a href="https://apify.com/" rel="noopener noreferrer"&gt;Apify&lt;/a&gt; offer access to numerous pre-built scrapers in &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt;, allowing you to extract data from data-rich websites such as &lt;a href="https://apify.com/compass/crawler-google-places" rel="noopener noreferrer"&gt;Google Maps&lt;/a&gt;, &lt;a href="https://apify.com/bernardo/youtube-scraper" rel="noopener noreferrer"&gt;YouTube&lt;/a&gt;, and &lt;a href="https://apify.com/lexis-solutions/meta-threads-replies-scraper" rel="noopener noreferrer"&gt;Meta's Threads&lt;/a&gt;. Additionally, for those interested in flexing their web scraping skills, &lt;a href="https://youtu.be/8QJetr-BYdQ" rel="noopener noreferrer"&gt;building and deploying custom scrapers&lt;/a&gt; is an option.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Feature engineering and selection&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Feature engineering involves transforming raw data into meaningful features that capture the underlying patterns and relationships. This step often requires domain expertise and creativity. Feature selection aims to identify the most relevant features for the model, reducing complexity and improving efficiency.&lt;/p&gt;

&lt;p&gt;To assist with feature engineering and selection, we can utilize libraries such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://scikit-learn.org/stable/" rel="noopener noreferrer"&gt;&lt;strong&gt;Scikit-learn&lt;/strong&gt;&lt;/a&gt;: Scikit-learn offers a wide range of feature extraction and transformation techniques. It helps us handle different data types, encode categorical variables for numerical representation, scale numerical features, generate new informative features, and perform feature selection to improve model performance. In short, Scikit-learn streamlines feature engineering, making data preprocessing and transformation easier, resulting in more effective machine learning models.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://featuretools.alteryx.com/en/stable/" rel="noopener noreferrer"&gt;&lt;strong&gt;Featuretools&lt;/strong&gt;&lt;/a&gt;: Featuretools is a library designed for automated feature engineering in machine learning. It enables us to create new features by combining existing ones, making it easier to capture complex relationships and patterns in the data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To illustrate how this step let's consider a text classification task where we want to classify news articles into different categories. We can use &lt;a href="https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing" rel="noopener noreferrer"&gt;Scikit-learn to preprocess the text data&lt;/a&gt;, convert it into numerical features, and select the most important features using the TF-IDF (Term Frequency-Inverse Document Frequency) method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2

# Preprocess the text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)

# Select the most important features
selector = SelectKBest(chi2, k=1000)
X_selected = selector.fit_transform(X, labels)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;3. Model building and training&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Model building involves selecting an appropriate algorithm or model architecture to solve the problem at hand. Python offers a wide range of algorithms and models, each suited for different types of problems. Once the model is chosen, it needs to be trained on labeled data to learn the patterns and make accurate predictions.&lt;/p&gt;

&lt;p&gt;To build and train machine learning models, we can rely on libraries such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scikit-learn&lt;/strong&gt; : Scikit-learn not only can help us with step 2 (Feature Engineering and Selection) but it also offers a consistent API that facilitates the training process with functions for model fitting, hyperparameter tuning, and model serialization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.tensorflow.org/" rel="noopener noreferrer"&gt;&lt;strong&gt;TensorFlow&lt;/strong&gt;&lt;/a&gt;: TensorFlow is a popular deep-learning framework that allows us to build and train neural networks for various tasks. It offers a wide range of pre-built neural network architectures and supports custom model creation. TensorFlow provides efficient computation on GPUs and TPUs, enabling faster training for large-scale models.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To illustrate this, lets take a look at how we would implement this step in a real project using &lt;strong&gt;Scikit-learn&lt;/strong&gt; and &lt;strong&gt;TensorFlow&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let's take a classification problem as an example. We can use logistic regression from &lt;a href="https://scikit-learn.org/stable/supervised_learning.html#supervised-learning" rel="noopener noreferrer"&gt;Scikit-learn to train a model&lt;/a&gt; on labeled data and make predictions on new, unseen data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now lets take it a step further and see how we can use &lt;strong&gt;TensorFlow&lt;/strong&gt; not only to build and train the model but also to make predictions and, finally, deploy it.&lt;/p&gt;

&lt;p&gt;For example, imagine we are building a handwritten digit recognition system. The neural network architecture defined in the code below could be trained on a dataset of handwritten digit images along with their corresponding labels. Once trained, the model can make predictions on new, unseen digit images, accurately classifying them into their respective digits (0 to 9).&lt;/p&gt;

&lt;p&gt;Then, the trained model can be saved and deployed in a production environment, where it can be integrated into a larger application or used as an &lt;a href="https://blog.apify.com/what-is-an-api/" rel="noopener noreferrer"&gt;API&lt;/a&gt; to provide digit recognition functionality to end users.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import tensorflow as tf

# Creating a simple neural network
model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compiling the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training the model
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))

# Making predictions
predictions = model.predict(x_test)

# Deploying the model
model.save('model.h5')

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;4. Model evaluation and validation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After training the model, it is essential to assess its performance and validate its ability to generalize well on unseen data. Evaluation metrics such as &lt;strong&gt;accuracy&lt;/strong&gt; , &lt;strong&gt;precision&lt;/strong&gt; , &lt;strong&gt;recall&lt;/strong&gt; , and &lt;strong&gt;F1 score&lt;/strong&gt; provide insights into the model's effectiveness. Validation techniques like &lt;strong&gt;cross-validation&lt;/strong&gt; help estimate how well the model will perform in the real world.&lt;/p&gt;

&lt;p&gt;Before we get to the libraries we use for model evaluation and validation, lets understand what exactly the metrics and techniques mentioned above measure and why they are important for building reliable machine-learning models.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Evaluation metrics&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Accuracy&lt;/strong&gt; : Measures the &lt;em&gt;proportion of correctly classified instances&lt;/em&gt; out of the total instances. It is calculated as the number of correct predictions divided by the total number of predictions. Accuracy provides a general measure of how well the model performs overall. For example, in email spam detection, accuracy measures the percentage of correctly classified emails as spam or non-spam.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Precision&lt;/strong&gt; : The &lt;em&gt;proportion of correctly predicted positive instances&lt;/em&gt; out of all instances predicted as positive. It represents the model's ability to avoid false positive errors, indicating how precise the positive predictions are. Precision is important in scenarios where false positives are costly. For instance, in medical diagnosis, precision is crucial to accurately identify patients with a specific disease to avoid unnecessary treatments or interventions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Recall&lt;/strong&gt; : Also known as sensitivity or true positive rate, &lt;em&gt;measures the proportion of correctly predicted positive instances out of all actual positive instances&lt;/em&gt;. It captures the model's ability to find all positive instances, avoiding false negatives. Recall is particularly important when the cost of false negatives is high. For example, in fraud detection, recall is essential to identify as many fraudulent transactions as possible, even if it means a higher number of false positives.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;F1 score&lt;/strong&gt; : The F1 score is the harmonic mean of precision and recall. It provides a &lt;em&gt;balanced measure of the model's performance, considering both precision and recall simultaneously&lt;/em&gt;. The F1 score is useful when there is an uneven class distribution or when both precision and recall are equally important. For example, in information retrieval systems, the F1 score is commonly used to evaluate search algorithms, where both precision and recall are crucial in providing accurate and comprehensive search results.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Validation techniques (cross-validation)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Cross-validation helps assess a model's generalization performance and mitigate the risk of overfitting. It plays a crucial role in machine learning for the following reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Performance estimation:&lt;/strong&gt; Cross-validation provides a more reliable estimate of how well a model will perform on unseen data by evaluating it on multiple validation sets. This helps determine if the model has learned meaningful patterns or is simply memorizing the training data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hyperparameter tuning&lt;/strong&gt; : Cross-validation aids in selecting the best set of hyperparameters for a model. By comparing performance across different parameter configurations, it helps identify the optimal combination that maximizes performance on unseen data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Model selection&lt;/strong&gt; : Cross-validation allows for a fair comparison between different models or algorithms. By evaluating their performance on multiple validation sets, it assists in choosing the most suitable model for the given problem, considering accuracy, precision, recall, or specific requirements.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data leakage prevention&lt;/strong&gt; : Cross-validation mitigates data leakage by creating separate validation sets that are not used during model training. This ensures a fair evaluation and avoids unintentional over-optimization based on the test set.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In real-life applications, cross-validation is particularly valuable in tasks such as credit risk assessment, where accurate predictions on unseen data are essential for decision-making.&lt;/p&gt;

&lt;p&gt;In summary, cross-validation is essential for the development of robust models that generalize well to new instances and provides confidence in their performance outside the training data.&lt;/p&gt;

&lt;p&gt;To evaluate and validate machine learning models, we can utilize libraries such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scikit-learn&lt;/strong&gt; : Scikit-learn offers a wide range of evaluation metrics for classification, regression, and clustering tasks. It provides functions for calculating accuracy, precision, recall, F1 score, and more. &lt;a href="https://scikit-learn.org/stable/modules/cross_validation.html" rel="noopener noreferrer"&gt;Scikit-learn also includes techniques for cross-validation&lt;/a&gt;, which allows for robust performance estimation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.scikit-yb.org/en/latest/" rel="noopener noreferrer"&gt;&lt;strong&gt;Yellowbrick&lt;/strong&gt;&lt;/a&gt;: Yellowbrick is a visualization library that integrates with Scikit-learn and provides visual tools for model evaluation and diagnostics. It offers visualizations for classification reports, learning curves, confusion matrices, and feature importances, aiding in the analysis of model performance.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, lets take a look at how we can use some of &lt;a href="https://scikit-learn.org/stable/model_selection.html#model-selection" rel="noopener noreferrer"&gt;Scikit-learns various evaluation metrics and validation techniques&lt;/a&gt;. Remember our previous example of a classification model? We can use Scikit-learn to evaluate the model's performance by calculating accuracy, precision, recall, and F1 score, and while we are at it, we can also use cross-validation to estimate the model's performance on unseen data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

# Evaluate the model
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=5)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;5. Model deployment and monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once a satisfactory model is obtained then the exciting part begins: deploying it to production environments for real-world usage. This step involves integrating the model into an application or system and ensuring its performance is continuously monitored and optimized over time.&lt;/p&gt;

&lt;p&gt;To deploy and monitor machine learning models, we can rely on libraries such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://flask.palletsprojects.com/en/2.3.x/" rel="noopener noreferrer"&gt;&lt;strong&gt;Flask&lt;/strong&gt;&lt;/a&gt;: Flask is a lightweight web framework that allows us to build APIs for serving machine learning models. It provides a simple and scalable way to expose our models as web services, enabling seamless integration into applications or systems.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.tensorflow.org/tensorboard/get_started" rel="noopener noreferrer"&gt;&lt;strong&gt;TensorBoard&lt;/strong&gt;&lt;/a&gt;: TensorBoard is a powerful visualization tool that comes bundled with TensorFlow. It helps monitor and analyze the performance of deep learning models by providing interactive visualizations of metrics, model architectures, and training progress.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/" rel="noopener noreferrer"&gt;&lt;strong&gt;Prometheus and&lt;/strong&gt;  &lt;strong&gt;Grafana&lt;/strong&gt;&lt;/a&gt;: Prometheus is a monitoring and alerting toolkit, while Grafana is a visualization tool. Together, they offer a robust solution for monitoring the performance and health of machine learning models in real time, providing valuable insights and enabling proactive optimization.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The choice of deployment and monitoring tools for machine learning models depends on the project and libraries you are comfortable with. For example, if you are building TensorFlow models, using TensorBoard to deploy them would be a great option.&lt;/p&gt;

&lt;p&gt;But we are also not restricted to choosing a single library. To deploy and monitor machine learning models, we can use a combination of libraries. For instance, we can use Flask to create an API to serve the model predictions, while using Prometheus to access its monitoring and alerting capabilities, and Grafana for visualization of performance metrics. Together, they provide a robust solution for deploying and monitoring machine learning models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from flask import Flask, request, jsonify
import prometheus_client
from prometheus_flask_exporter import PrometheusMetrics
import json

app = Flask( __name__ )
metrics = PrometheusMetrics(app)

@app.route('/predict', methods=['POST'])
def predict():
    data = json.loads(request.data)
    # Process the data and make predictions
    predictions = model.predict(data)
    return jsonify(predictions)

if __name__ == ' __main__':
    app.run()

# Monitor the model using Prometheus and Grafana...

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Whats next in machine learning and Python?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this article, we have explored the world of machine learning with Python and discussed some of the best libraries available for developing machine learning models. Python's simplicity, flexibility, and extensive library ecosystem make it an ideal choice for both beginners and experienced developers venturing into the field of machine learning.&lt;/p&gt;

&lt;p&gt;As you embark on your machine-learning journey with Python, we encourage you to explore these libraries further. Dive into their documentation, experiment with different algorithms and techniques, and leverage the vast online resources and communities available to you.&lt;/p&gt;

&lt;p&gt;Remember, machine learning is a rapidly evolving field, and staying up to date with the latest advancements and techniques is crucial. If youre interested in continuing, why not try training your own language model to create a personalized ChatGPT using &lt;a href="https://blog.apify.com/how-to-use-langchain/" rel="noopener noreferrer"&gt;LangChain, OpenAI, Pinecone, and Apify&lt;/a&gt;?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>ScrapingBee review: top web scraping API?</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Sun, 23 Jul 2023 22:00:00 +0000</pubDate>
      <link>https://dev.to/percivalvillal3/scrapingbee-review-top-web-scraping-api-50a</link>
      <guid>https://dev.to/percivalvillal3/scrapingbee-review-top-web-scraping-api-50a</guid>
      <description>&lt;h3&gt;
  
  
  There are lots of web scraping services out there, but which is the right choice for you? We look at ScrapingBee to see what it offers the dev looking to get data.
&lt;/h3&gt;

&lt;p&gt;Whether you're building an application, conducting market research, or analyzing trends, accessing timely and accurate data is essential. However, identifying the most efficient and reliable methods for obtaining this data can be a daunting task. Should you build your own web scrapers? Use an existing web scraping API? Or go for something in between?&lt;/p&gt;

&lt;p&gt;If you've spent some time googling around for an answer to those questions, then you've probably come across &lt;a href="https://www.scrapingbee.com/"&gt;ScrapingBee&lt;/a&gt; But now a different question emerges. How do I know if this service is right for my use case? Well, thats precisely what we will try to answer in this article. We will review ScrapingBees service and analyze the different kinds of tools that they provide, and the pros and cons of using the service.&lt;/p&gt;

&lt;p&gt;So, lets get started and see if ScrapingBee is worth using for your web scraping project.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;ScrapingBee: what are the pros and cons?&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Benefits: user-friendly web scraping API&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ScrapingBee provides a user-friendly web scraping API that offers various features required for large-scale &lt;a href="https://apify.com/web-scraping"&gt;web scraping&lt;/a&gt; and to prevent getting blocked, including proxies and JavaScript rendering. It is recommended for developers seeking a simple solution for extracting data, which can be seamlessly integrated with their existing code for data processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Limitations: limited control and no integrated cloud solution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ScrapingBee's straightforward approach may be limiting for developers with advanced web scraping knowledge, as they are required to follow the rules set by ScrapingBee's API and have restricted control over the entire data extraction process.&lt;/p&gt;

&lt;p&gt;Additionally, ScrapingBee lacks an integrated solution for managing data extraction flows in the cloud. This can be inconvenient since you would need to find a separate cloud provider or set up your own infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;ScrapingBee Proxy and API credit consumption&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When it comes to large-scale data extraction, proxies are essential for circumventing anti-bot systems used by modern websites. However, utilizing proxies can significantly increase the cost of your web scraping activities. ScrapingBee's API provides several proxy options: Rotating Proxy (default), Premium Proxy, Stealth Proxy, or the ability to use your own proxy. Here is an overview of how the usage of these proxies impacts your API Credit consumption within their system:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Feature used&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;API credit cost/request&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Rotating Proxy without JavaScript rendering&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rotating Proxy with JavaScript rendering (default)&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Premium Proxy without JavaScript rendering&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Premium Proxy with JavaScript rendering&lt;/td&gt;
&lt;td&gt;25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stealth Proxy with JavaScript rendering (only option available)&lt;/td&gt;
&lt;td&gt;75&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;ScrapingBee pricing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The pricing of a service often plays a crucial role in our decision-making process. Fortunately, ScrapingBee provides a freemium model that allows users to try their service for free with 1,000 API credits. Their paid plans range from $49/month to $599+/month for the business plan. The key distinction between these plans is the allocation of API credits, with the base plan offering 150,000 credits and the business plans providing 8,000,000+ credits, depending on your needs. Additionally, the more expensive plans offer higher limits for concurrent requests and improved support.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;ScrapingBee scraping test&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;ScrapingBee offers a versatile data extraction API as one of its primary services, allowing users to extract data from a wide range of web pages. To evaluate its capabilities, I decided to scrape &lt;a href="http://Amazon.com"&gt;Amazon.com&lt;/a&gt;, a well-known website notorious for implementing sophisticated anti-bot systems.&lt;/p&gt;

&lt;p&gt;Navigating through ScrapingBee's API was straightforward, and the ScrapingBee documentation provided clear and updated information. With just a few lines of code, as shown in the example below, I successfully extracted the titles, prices, and links of the iPhones listed on the first page of &lt;a href="http://Amazon.com"&gt;Amazon.com&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from scrapingbee import ScrapingBeeClient # Importing SPB's client
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

response = client.get("&amp;lt;https://www.amazon.com/s?k=iphone&amp;amp;crid=1BIGRK4NGFLDS&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;ref=nb_sb_noss_2&amp;gt;", params={
'extract_rules':{
                 "product-titles": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style &amp;gt; h2 &amp;gt; a &amp;gt; span",
                     "type": "list",
                 },
                  "product-prices": {
                      "selector": "div.a-section.a-spacing-none.a-spacing-top-micro.puis-price-instructions-style &amp;gt; div &amp;gt; a &amp;gt; span &amp;gt; span.a-offscreen",
                      "type": "list",
                  },
                  "product-links": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style &amp;gt; h2 &amp;gt; a",
                     "type": "list",
                     "output": "@href"
                 },

                }
})

if response.ok:
    print(response.json())

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you want to test the provided code yourself, follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Create a ScrapingBee account.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replace the placeholder text in the code with your own ScrapingBee API key.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you have completed these steps and run the code, you can expect to see results similar to the example below printed to your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
   "product-titles":[
      "Apple iPhone 11, 64GB, Black - Unlocked (Renewed)",
      "Apple iPhone SE (2nd Generation), 64GB, Red - Unlocked (Renewed)",
      "Apple iPhone 12, 64GB, White - Fully Unlocked (Renewed)",
      "Apple iPhone 8, 64GB, Gold - Unlocked (Renewed)",
      "Apple iPhone 12 Mini, 64GB, Black - Unlocked (Renewed)",
      "Apple iPhone X, US Version, 64GB, Silver - Unlocked (Renewed)",
      "Apple iPhone XR, 64GB, Black - Unlocked (Renewed)",
      "Apple iPhone XS, US Version, 64GB, Space Gray - Unlocked (Renewed)",
      "Apple iPhone 8 Plus, US Version, 64GB, Gold - Unlocked (Renewed)",
      "Apple iPhone 14 Pro Max, 128GB, Space Black - Unlocked (Renewed)",
      "Apple iPhone 13, 256GB, Midnight - Unlocked (Renewed)",
      "Apple iPhone 11 Pro, 64GB, Midnight Green - Unlocked (Renewed)",
      "iPhone 13 Mini, 128GB, Pink - Unlocked (Renewed)",
      "Apple iPhone 12 Pro, 256GB, Gold - Fully Unlocked (Renewed)",
      "Apple iPhone SE 3rd Gen, 64GB, Midnight - Unlocked (Renewed)",
      "Apple iPhone 14, 512GB, Purple - Unlocked (Renewed Premium)"
   ],
   "product-prices":[
      "$305.55",
      "$147.00",
      "$394.95",
      "$137.99",
      "$308.99",
      "$223.00",
      "$214.75",
      "$232.00",
      "$189.99",
      "$1,019.99",
      "$629.99",
      "$388.00",
      "$494.99",
      "$584.99",
      "$257.99",
      "$875.00"
   ],
   "product-links":[
      "/Apple-iPhone-11-64GB-Black/dp/B07ZPKN6YR/ref=sr_1_1?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-1",
      "/Apple-iPhone-SE-64GB-Red/dp/B088N8TF64/ref=sr_1_2?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-2",
      "/Apple-iPhone-12-64GB-White/dp/B08PPBQM23/ref=sr_1_3?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-3",
      "/Apple-iPhone-Fully-Unlocked-64GB/dp/B0775717ZP/ref=sr_1_4?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-4",
      "/Apple-iPhone-12-Mini-Black/dp/B08PPDJWC8/ref=sr_1_5?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-5",
      "/Apple-iPhone-Fully-Unlocked-64GB/dp/B07C357FSJ/ref=sr_1_6?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-6",
      "/Apple-iPhone-XR-Fully-Unlocked/dp/B07P6Y7954/ref=sr_1_7?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-7",
      "/Apple-iPhone-64GB-Space-Gray/dp/B07SC58QBW/ref=sr_1_8?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-8",
      "/Apple-iPhone-Plus-Fully-Unlocked/dp/B07757LZ1J/ref=sr_1_9?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-9",
      "/Apple-iPhone-14-Pro-Max/dp/B0BN94DL3R/ref=sr_1_10?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-10",
      "/Apple-iPhone-13-256GB-Midnight/dp/B09LNCVCKW/ref=sr_1_11?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-11",
      "/Apple-iPhone-64GB-Midnight-Green/dp/B07ZQRMWVB/ref=sr_1_12?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-12",
      "/Apple-iPhone-13-Mini-128GB/dp/B09LKF2RPP/ref=sr_1_13?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-13",
      "/Apple-iPhone-Pro-256GB-Gold/dp/B08PN7R2MZ/ref=sr_1_14?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-14",
      "/Apple-iPhone-SE-3rd-Midnight/dp/B0BDY71GRG/ref=sr_1_15?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-15",
      "/Apple-iPhone-14-512GB-Purple/dp/B0BYKX35NT/ref=sr_1_16?keywords=iphone&amp;amp;qid=1688323279&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;sr=8-16"
   ]
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this specific request, using ScrapingBee's API with the default configurations (Rotating Proxy and JavaScript rendering), I was charged 5 API credits. Despite making multiple requests to &lt;a href="http://Amazon.com"&gt;Amazon.com&lt;/a&gt;, I did not encounter any blocking issues when using the API's default settings, which is a good sign about the services reliability.&lt;/p&gt;

&lt;p&gt;However, as our operation scales up, it is reasonable to assume that we would require more reliable and costly proxies to sustain this level of performance. So, let's see how we can enable different proxy options using ScrapingBee's API.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Using proxies in ScrapingBee&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Enabling proxies in ScrapingBee is straightforward. To use a specific proxy type, you just need to include the corresponding parameter and set it to "True". For instance, to utilize the Premium Proxy, you would add &lt;code&gt;"premium_proxy=True"&lt;/code&gt; to your response parameters, as shown below:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from scrapingbee import ScrapingBeeClient # Importing SPB's client
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

response = client.get("&amp;lt;https://www.amazon.com/s?k=iphone&amp;amp;crid=1BIGRK4NGFLDS&amp;amp;sprefix=ipho%2Caps%2C278&amp;amp;ref=nb_sb_noss_2&amp;gt;", params={
# Choose the proxy type you want by adding the premium_proxy, stealth_proxy or own_proxy parameters
'premium_proxy': 'True',
'extract_rules':{
                 "product-titles": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style &amp;gt; h2 &amp;gt; a &amp;gt; span",
                     "type": "list",
                 },
                  "product-prices": {
                      "selector": "div.a-section.a-spacing-none.a-spacing-top-micro.puis-price-instructions-style &amp;gt; div &amp;gt; a &amp;gt; span &amp;gt; span.a-offscreen",
                      "type": "list",
                  },
                  "product-links": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style &amp;gt; h2 &amp;gt; a",
                     "type": "list",
                     "output": "@href"
                 },

                }
})

if response.ok:
    print(response.json())

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;It's worth mentioning that enabling this option can enhance the reliability of our data extraction process by reducing the risk of our bot being blocked. However, it's important to note that this improvement comes at a higher cost per request.&lt;/p&gt;

&lt;p&gt;For instance, in my case, using the Premium Proxy and JavaScript rendering for this request consumed 25 credits, which is a fivefold increase compared to the 5 credits spent when using the default Proxy rotation configuration.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Limitations of the ScrapingBee web scraping API&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Although I was pleasantly surprised by the ease of extracting the desired data and the low incidence of blocked requests, I found it frustrating that the API had limitations when it came to more complex operations. For instance, if I were building my own scraper, I could easily handle Amazon's pagination and extract data from all the search results while maintaining complete control over the scraper's behavior. However, achieving a similar outcome using ScrapingBee's API was not immediately apparent, and their documentation lacked information on this matter.&lt;/p&gt;

&lt;p&gt;Furthermore, the simplicity of ScrapingBee's pricing system has both positive and negative aspects. It is reassuring to know the exact number of credits each request will cost based on the chosen parameters. However, I would have appreciated a more detailed breakdown of my usage and charges within ScrapingBee's dashboard for better transparency.&lt;/p&gt;

&lt;p&gt;Lastly, I missed having convenient access to an integrated cloud infrastructure like &lt;a href="https://apify.com/"&gt;Apify&lt;/a&gt; or Zyte. While I understand that is not ScrapingBee's primary focus, having an all-in-one solution for my web scraping needs would save considerable time and effort, rather than having to search for and pay for different services to host my data extraction workflows.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion and final considerations&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In conclusion, the ScrapingBee Data Extraction API offers a reliable solution for developers seeking a straightforward method to extract data from websites without the complexities of building a scraper from scratch. However, if you require a more comprehensive solution with a wider range of pre-built features and greater control over your applications and data extraction process, relying solely on ScrapingBee may not fully meet your needs.&lt;/p&gt;

&lt;p&gt;Finally, I want to emphasize that this post serves as an introductory analysis and guide to ScrapingBee's service, assisting developers in determining if it is the right choice for them. It is important to note that not all features provided by their API have been explored in this article.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is the first in a series of articles we commissioned from an external developer (although Percival is a former Apifier). We want to create unbiased reviews of other web scraping platforms and companies as part of our continued evaluation of the web scraping industry.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you find yourself intrigued by ScrapingBee, I encourage you to further &lt;a href="https://www.scrapingbee.com/documentation/"&gt;explore the ScrapingBee documentation&lt;/a&gt; for a more in-depth understanding of the platform's capabilities.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/best-web-scraping-api/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--1CtZqB0G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/05/neon-datascape-illustrating-web-scraping-api.jpg" height="449" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/best-web-scraping-api/" rel="noopener noreferrer" class="c-link"&gt;
          Best web scraping APIs in 2023
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          We explore 10 top-notch web scraping API options.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://res.cloudinary.com/practicaldev/image/fetch/s--q_zdUqT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png" width="128" height="128"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;.&lt;/p&gt;

</description>
      <category>webscraping</category>
    </item>
    <item>
      <title>How to automate forms with JavaScript and Playwright</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Sun, 16 Jul 2023 22:00:00 +0000</pubDate>
      <link>https://dev.to/apify/how-to-automate-forms-with-javascript-and-playwright-4077</link>
      <guid>https://dev.to/apify/how-to-automate-forms-with-javascript-and-playwright-4077</guid>
      <description>&lt;p&gt;Whether we need to collect data from multiple sources, perform form testing, or automate mundane form submissions, learning how to submit forms with Playwright and JavaScript can help us automate these tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0glji67qv9s6sj9zv3rl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0glji67qv9s6sj9zv3rl.png" alt="Automating forms is easy with Playwright: follow our guide to learn why and how" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Automating forms is easy with Playwright: follow our guide to learn why and how&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why automate forms?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we get into the technical details, let's consider a few scenarios where form automation can be beneficial:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data collection&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Imagine we need to gather data from various online marketplaces, including product details, prices, and more. The traditional manual approach is time-consuming and error-prone, requiring navigation to each product page and form filling. However, automation tools like Playwright enables us to automatically navigate these marketplaces, extract the necessary information, and populate a database, saving time and reducing errors.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;a href="https://blog.apify.com/how-to-scrape-the-web-with-playwright-ece1ced75f73/" rel="noopener noreferrer"&gt;&lt;strong&gt;How to scrape the web with Playwright in 2023&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Form testing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When &lt;a href="https://citrusbug.com/outsourcing-services/web-application-development-company" rel="noopener noreferrer"&gt;developing web applications&lt;/a&gt;, testing the functionality and behavior of forms is crucial. Manually testing each scenario can be repetitive and inefficient. By automating testing with form submissions, input validation testing, and error handling, we ensure the smooth operation of our applications.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔍&lt;/strong&gt; &lt;a href="https://blog.apify.com/11-best-automated-browser-testing-tools-for-developers/" rel="noopener noreferrer"&gt;&lt;strong&gt;11 best automated browser testing tools for developers&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/11-best-automated-browser-testing-tools-for-developers/" rel="noopener noreferrer"&gt;Read about automated browser testing and the best tools for testing your web apps.&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Repetitive tasks&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Many online services require filling in forms repeatedly, such as submitting support requests, job applications, or entering sweepstakes. Again, automating these repetitive tasks will save time and effort. This can be done for individuals, such as if you want to repeatedly enter your details, or at scale, for companies or &lt;a href="https://apify.com/web-scraping" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt; projects.&lt;/p&gt;

&lt;p&gt;By the end of this article, you will have a solid understanding of how to automate forms using Playwright. You can then apply the techniques we will learn here to your own projects. So, lets get coding!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What you'll need to start automating forms&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before getting started, there are a few prerequisites you should have in place:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Basic understanding of HTML forms and browser DevTools:&lt;/strong&gt; It's helpful to have a fundamental understanding of HTML forms and their structure as well as being able to use your browser DevTools to inspect elements on a webpage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;JavaScript knowledge:&lt;/strong&gt; Since we'll be using JavaScript and Node.js as the programming language for our project, it's essential to have at least a basic understanding of concepts such as variables, functions, and asynchronous programming in JavaScript.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Node.js installation:&lt;/strong&gt; Ensure that Node.js is installed on your local machine. You can download and install the latest version of Node.js from the official &lt;a href="https://nodejs.org/" rel="noopener noreferrer"&gt;Node.js website&lt;/a&gt; and use &lt;a href="https://blog.apify.com/how-to-install-nodejs/" rel="noopener noreferrer"&gt;our guide on how to install it correctly&lt;/a&gt;. Node.js will allow us to run JavaScript code outside of the browser environment.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need to have any experience with Playwright, but you might like to &lt;a href="https://blog.apify.com/what-is-playwright/" rel="noopener noreferrer"&gt;find out more about it&lt;/a&gt; before you start.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Setting up your form automation project&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Create a new project directory&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Choose a suitable location on your computer and create a new directory for your project. You can name it anything you like, for example &lt;code&gt;form-automation-project&lt;/code&gt;. So, lets open the terminal, navigate to the desired location, and use the following command to create the directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mkdir form-automation-project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Initialize a new Node.js project&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Change into the newly created project directory and initialize a new Node.js project by running the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd form-automation-project
npm init -y
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command generates a &lt;code&gt;package.json&lt;/code&gt; file that will keep track of the project's dependencies and configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Install Playwright&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;With the project initialized, we can now install the Playwright library. In your terminal or command prompt, run the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm install playwright
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Update package.json to use module syntax&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To enable the use of the JavaScript module syntax when building our project, add &lt;code&gt;"type": "module"&lt;/code&gt; to your &lt;code&gt;package.json&lt;/code&gt; file. This syntax allows us to take advantage of the ES module system, which provides a more standardized and modern approach to organizing and importing JavaScript code.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Launching the browser and navigating to the form&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now that our project is set up, we can begin automating the form by launching a browser instance and navigating to the target webpage containing the form.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a new JavaScript file:&lt;/strong&gt; In the project directory, lets create a new JavaScript file to hold the logic for our bot. You can name it &lt;code&gt;bot.js&lt;/code&gt; or choose any other suitable name.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Import Playwright modules&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Open the newly created &lt;code&gt;bot.js&lt;/code&gt; file in your preferred code editor. At the top of the file, lets import the necessary Playwright modules using the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { chromium } from 'playwright';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This imports the &lt;code&gt;chromium&lt;/code&gt; module from Playwright, which allows us to automate Chromium-based browsers like Google Chrome.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Launch a browser instance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Below the import statement, add the following code to launch a new browser instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async function submitForm() =&amp;gt; {
  const browser = await chromium.launch({ headless: false });
})();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code uses an asynchronous function to launch a browser instance with Playwright's &lt;code&gt;launch()&lt;/code&gt; method. The &lt;code&gt;browser&lt;/code&gt; variable will hold the browser instance for further interactions. Note that we are also passing the parameter &lt;code&gt;headless: false&lt;/code&gt; so we can see our bot in action.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Navigate to the target web page&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To navigate to the web page containing the form, add the following code after launching the browser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async function submitForm() =&amp;gt; {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://www.example.com/form'); // Replace with the actual URL of the form
};
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this code, we create a new context and a new page within that context. Then, we use the &lt;code&gt;goto()&lt;/code&gt; method to navigate to the URL of the webpage containing the form. Make sure to replace &lt;code&gt;'https://www.example.com/form'&lt;/code&gt; with the actual URL of the form you want to automate.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📄 If you would like to test your code on an actual form, you can follow along with this &lt;a href="https://docs.google.com/forms/d/e/1FAIpQLScMXkPc5uVQaPTFrHNY1Sb4C6n0WCxd2R6gk5ZVhh4KOvvt-Q/viewform" rel="noopener noreferrer"&gt;example Google Form&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Locating and filling form fields&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now that we have successfully navigated to the webpage containing the form, we can proceed to locate the form fields and fill them in with our desired values.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Locate form fields&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To interact with form fields, we need to locate them on the web page. Inspect the HTML structure of the form to identify the attributes or selectors we can use to locate each field. Common attributes include &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;id&lt;/code&gt;, and &lt;code&gt;class&lt;/code&gt;. For example, let's assume we have an input field with the name attribute "firstName". We can locate it using Playwright's &lt;code&gt;page.locator()&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const firstNameField = page.locator('input[name="firstName"]');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, in our Google form example, we have to first inspect the input field we want to find the selectors that we can then use to target it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1uz2sowrwp2its8naixl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1uz2sowrwp2its8naixl.png" alt="Form Field Selector" width="800" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And now we can target it by using the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const firstNameField = page.locator('input[aria-labelledby="i1"]')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Repeat this approach to target all the form fields you need to fill in. For example, these are the selectors for each of the fields present in our dummy Google form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const firstNameField = page.locator('input[aria-labelledby="i1"]');
const emailField = page.locator('input[aria-labelledby="i5"]');
const addressField = page.locator('textarea[aria-labelledby="i9"]');
const phoneField = page.locator('input[aria-labelledby="i13"]');
const commentsField = page.locator('textarea[aria-labelledby="i17"]');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Fill form fields&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once we have located a form field, we can fill it with the desired value using Playwright's &lt;code&gt;fill()&lt;/code&gt; method. Add the following code after locating the field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;await firstNameField.fill('John');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;'John'&lt;/code&gt; with the value you want to fill in the field.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Handle different field types&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Form fields can vary in type, such as text inputs, checkboxes, radio buttons, dropdown menus, and file upload fields. Use Playwright's appropriate methods to interact with each field type. For example, to check a checkbox, use the &lt;code&gt;check()&lt;/code&gt; method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const checkboxField = page.locator('input[name="acceptTerms"]');
await checkboxField.check();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remember to adjust the code based on the specific field types and actions you want to perform.&lt;/p&gt;

&lt;p&gt;For instance, lets fill in the details for a fictitious John in our dummy form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Fill the form fields
await firstNameField.fill('John');
await emailField.fill('john@gmail.com');
await addressField.fill("John's Street");
await phoneField.fill('11111111');
await commentsField.fill('This form was submitted automatically.');
await checkboxField.check();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Submit the form&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once we have filled in all the necessary form fields, its time to submit our form using Playwright's &lt;code&gt;submit()&lt;/code&gt; method. Locate the submit button and add the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const submitButton = page.locator('button[type="submit"]');
await submitButton.click();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Close the browser&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After we successfully submit our form it is important to explicitly tell Playwright to close the browser instance, otherwise, it would continue open even after the form is submitted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Close the browser
await browser.close();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Final code for automating the form&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;And here is what the final code for automating our dummy form looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { chromium } from 'playwright';

async function submitForm() {
    const browser = await chromium.launch({ headless: false });
    const context = await browser.newContext();
    const page = await context.newPage();

    await page.goto('https://forms.gle/7rhchFiZF2faMQxz7'); // Replace with the actual URL of the form

    await page.waitForSelector('div.lRwqcd &amp;gt; div', {
        state: 'visible',
    }); // Wait for form element to be visible on the page before proceeding

    // Select the form fields we want to target
    const firstNameField = page.locator('input[aria-labelledby="i1"]');
    const emailField = page.locator('input[aria-labelledby="i5"]');
    const addressField = page.locator('textarea[aria-labelledby="i9"]');
    const phoneField = page.locator('input[aria-labelledby="i13"]');
    const commentsField = page.locator('textarea[aria-labelledby="i17"]');
    const checkboxField = page.locator('div#i26');
    const submitButton = page.locator('div.lRwqcd &amp;gt; div');

    // Fill the form fields
    await firstNameField.fill('John');
    await emailField.fill('john@gmail.com');
    await addressField.fill("John's Street");
    await phoneField.fill('11111111');
    await commentsField.fill('This form was submitted automatically.');
    await checkboxField.check();

    // Submit the form
    await submitButton.click();

    // Close the browser
    await browser.close();
}

submitForm();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that in the final version of our bot, I added an extra line of code to explicitly wait for a specific element on the page to load before proceeding with the rest of the code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;await page.waitForSelector('div.lRwqcd &amp;gt; div', {
        state: 'visible',
    });
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This step is not always necessary, but it is possible for the bot to attempt to "act" before the page is fully loaded. This can lead to an error due to the bot's inability to interact with the target element.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Automating multiple form submissions&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In some cases, you may need to automate multiple form submissions, such as when you have a batch of data to process or want to simulate user interactions. Here's how you can automate multiple form submissions using Playwright:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Encapsulate form submission logic&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To automate the submission of multiple forms, it is beneficial to encapsulate the form submission logic into a reusable function. We have already done this in the previous section, so we can continue using the form function we have created.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async function submitForm() {
  // Locate and fill form fields
  // Submit the form
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Use a loop&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Determine the number of times you want to submit the form and use a loop to automate the process. For example, if we want to submit the form five times, we can use a &lt;code&gt;for&lt;/code&gt; loop as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for (let i = 0; i &amp;lt; 5; i++) {
  await submitForm();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adjust the loop conditions and the number of iterations based on your requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Optional: add delays between submissions&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In some cases, it may be necessary to introduce delays between form submissions to mimic user behavior or account for server response times. We can use Playwright's &lt;code&gt;waitForTimeout&lt;/code&gt; method to add delays between submissions within the loop.&lt;/p&gt;

&lt;p&gt;For example, we could add a 2-second delay right before we tell Playwright to close:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// Wait before closing the browser
await page.waitForTimeout(2000); // Wait for 2 seconds before the next submission

// Close the browser
await browser.close();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Adjust the delay duration as needed.&lt;/p&gt;

&lt;p&gt;By encapsulating the form submission logic in a function and using a loop, we can automate as many form submissions as we want!&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Error handling and debugging&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When automating forms, it's essential to handle potential errors and have mechanisms in place for debugging. Here are some techniques we can use to do that:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Try-catch blocks&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We can wrap our code inside try-catch blocks to catch and handle any errors that may occur during form automation. This allows us to gracefully handle exceptions and prevent our automation script from crashing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;try {
  // Form automation code
} catch (error) {
  console.error('An error occurred:', error);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By logging the error to the console or reporting it in some other way, you can quickly identify and diagnose issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Logging and debugging statements&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use console.log() statements strategically throughout the code to output useful information. These statements can help us track the execution flow, inspect variable values, and identify potential issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;console.log('Filling in the first name field...');
// Your code for filling in the first name field
console.log('First name field filled successfully.');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By logging key steps or variables, you can gain insights into what's happening during the automation process.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Taking screenshots&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Playwright allows us to take screenshots of the browser at any point during the automation process. Capture screenshots to visualize the state of the page and potentially identify issues or unexpected behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;await page.screenshot({ path: 'screenshot.png' });
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save the screenshot to a file for later examination.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Inspecting network traffic&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Playwright provides tools for inspecting network traffic, which can be useful for debugging. We can intercept network requests, analyze responses, and verify data being sent and received.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;page.on('response', (response) =&amp;gt; {
  console.log('Received response:', response.url());
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use the &lt;code&gt;page.on()&lt;/code&gt; method to listen for specific events related to network traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code including error handling and debugging&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now, lets update our previous code to include the error handling and debugging techniques we discussed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { chromium } from 'playwright';

async function submitForm() {
    const browser = await chromium.launch({ headless: false });
    const context = await browser.newContext();
    const page = await context.newPage();

    await page.goto('https://forms.gle/7rhchFiZF2faMQxz7'); // Replace with the actual URL of the form

    await page.waitForSelector('div.lRwqcd &amp;gt; div', {
        state: 'visible',
    });

    try {
        // Select the form fields we want to target
        const firstNameField = page.locator('input[aria-labelledby="i1"]');
        const emailField = page.locator('input[aria-labelledby="i5"]');
        const addressField = page.locator('textarea[aria-labelledby="i9"]');
        const phoneField = page.locator('input[aria-labelledby="i13"]');
        const commentsField = page.locator('textarea[aria-labelledby="i17"]');
        const checkboxField = page.locator('div#i26');
        const submitButton = page.locator('div.lRwqcd &amp;gt; div');

        // Fill the form fields
        console.log('Filling in the first name field...');
        await firstNameField.fill('John');
        console.log('Filling in the email field...');
        await emailField.fill('john@gmail.com');
        console.log('Filling in the address field...');
        await addressField.fill("John's Street");
        console.log('Filling in the phone number field...');
        await phoneField.fill('11111111');
        console.log('Filling in the comments field...');
        await commentsField.fill('This form was submitted automatically.');
        console.log('Checking the box ...');
        await checkboxField.check();

        // Take screenshot of the completed form
        await page.screenshot({ path: 'screenshot.png' });

        // Submit the form
        await submitButton.click();

        // Wait before closing the browser
        await page.waitForTimeout(2000); // Wait for 2 seconds before the next submission

        // Close the browser
        await browser.close();
    } catch (error) {
        console.error('Oops, something went wrong:', error);
    }
}

for (let i = 0; i &amp;lt; 5; i++) {
    await submitForm();
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Advanced techniques&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this section, we'll explore some advanced techniques for form automation with Playwright. These techniques can help us handle more complex scenarios and overcome common challenges.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Handling dynamic forms&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Some forms may have fields that appear or disappear dynamically based on user interactions or other factors. We can handle such forms, by employing techniques like waiting for specific elements to appear or disappear using Playwright's &lt;a href="https://playwright.dev/docs/api/class-elementhandle#element-handle-wait-for-selector" rel="noopener noreferrer"&gt;&lt;code&gt;waitForSelector()&lt;/code&gt;&lt;/a&gt; or &lt;a href="https://playwright.dev/docs/api/class-frame#frame-wait-for-function" rel="noopener noreferrer"&gt;&lt;code&gt;waitForFunction()&lt;/code&gt;&lt;/a&gt; methods.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Working with CAPTCHAs and anti-bot measures&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Websites often employ &lt;a href="https://blog.apify.com/why-captchas-are-bad/" rel="noopener noreferrer"&gt;CAPTCHAs&lt;/a&gt; or other anti-bot measures to prevent automated interactions. Automating forms that include CAPTCHAs can be challenging. Consider using third-party services or libraries specifically designed to bypass CAPTCHAs or explore browser automation techniques like mouse movements or human-like delays to &lt;a href="https://blog.apify.com/crawl-without-getting-blocked/" rel="noopener noreferrer"&gt;mimic user behavior&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Handling file uploads&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If the form includes file upload fields, we can automate file uploads using Playwright's &lt;a href="https://playwright.dev/docs/api/class-elementhandle#element-handle-set-input-files" rel="noopener noreferrer"&gt;&lt;code&gt;setInputFiles()&lt;/code&gt;&lt;/a&gt; method. Specify the path to the file we want to upload, and Playwright will handle the file selection process for us.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Navigating between pages&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Sometimes, form automation may require navigating between multiple pages or steps. Use Playwright's page navigation methods like &lt;a href="https://playwright.dev/docs/api/class-frame#frame-goto" rel="noopener noreferrer"&gt;&lt;code&gt;goto()&lt;/code&gt;&lt;/a&gt; or &lt;a href="https://playwright.dev/docs/input#mouse-click" rel="noopener noreferrer"&gt;&lt;code&gt;click()&lt;/code&gt;&lt;/a&gt; to move between pages and perform interactions on each page as needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Parallelization and performance optimization&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To improve performance and reduce execution time, we can employ parallelization techniques. For example, we can use multiple browser contexts or instances of Playwright to run form automation in parallel, especially when dealing with a large number of forms or submitting forms with a time-consuming process.&lt;/p&gt;

&lt;p&gt;Remember, advanced techniques depend on the specific requirements and challenges of the forms you're automating. It's essential to understand the unique aspects of each form and apply the appropriate techniques accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;That's a wrap! Now you know how to automate forms with Playwright 🦾&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we explored how to automate forms using Playwright and JavaScript. We covered the essential steps to build a form-filling bot and provided explanations and code examples along the way. By now, you should have a solid understanding of automating forms using Playwright!&lt;/p&gt;

&lt;p&gt;We also discussed the importance of form automation in various real-world scenarios, including data collection, form testing, and automating repetitive tasks. By automating form submissions, you can save time, reduce errors, and increase efficiency.&lt;/p&gt;

&lt;p&gt;Now that you have learned how to automate forms using Playwright and JavaScript, feel free to apply these techniques to your own projects and explore further possibilities for automation.&lt;/p&gt;

&lt;p&gt;And remember to always respect the terms of service and usage policies of the websites you are automating. Use form automation responsibly and ethically, ensuring that your actions comply with legal and ethical standards. In other words, take Uncle Ben's advice to heart: " &lt;strong&gt;&lt;em&gt;With great power comes great responsibility&lt;/em&gt;&lt;/strong&gt;" 🕷&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🔍&lt;/strong&gt; &lt;a href="https://blog.apify.com/playwright-vs-puppeteer-which-is-better/" rel="noopener noreferrer"&gt;&lt;strong&gt;Playwright vs. Puppeteer: which is better?&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/playwright-vs-puppeteer-which-is-better/" rel="noopener noreferrer"&gt;Two powerful Node.js libraries: described and&lt;/a&gt;&lt;a href="https://blog.apify.com/playwright-vs-puppeteer-which-is-better/" rel="noopener noreferrer"&gt;&lt;strong&gt;Playwright vs. Puppeteer: which is better?&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>automation</category>
      <category>playwright</category>
      <category>node</category>
    </item>
    <item>
      <title>What are the best Python web scraping libraries?</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Mon, 22 May 2023 12:15:58 +0000</pubDate>
      <link>https://dev.to/apify/what-are-the-best-python-web-scraping-libraries-2g4l</link>
      <guid>https://dev.to/apify/what-are-the-best-python-web-scraping-libraries-2g4l</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-web-scraping-and-web-scraping-tools/"&gt;Web scraping&lt;/a&gt; is essentially a way to automate the process of extracting data from the web, and as a Python developer, you have access to some of the best libraries and frameworks available to help you get the job done.&lt;/p&gt;

&lt;p&gt;We're going to take a look at some of the most popular Python libraries and frameworks for web scraping and compare their pros and cons, so you know exactly what tool to use to tackle any web scraping project you might come across.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/r_n5_8NtHVc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;HTTP Libraries - Requests and HTTPX&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;First up, let's talk about HTTP libraries. These are the foundation of web scraping since every scraping job starts by making a request to a website and retrieving its contents, usually as HTML.&lt;/p&gt;

&lt;p&gt;Two popular HTTP libraries in Python are Requests and HTTPX.&lt;/p&gt;

&lt;p&gt;Requests is easy to use and great for simple scraping tasks, while HTTPX offers some advanced features like async and HTTP/2 support.&lt;/p&gt;

&lt;p&gt;Their core functionality and syntax are very similar, so I would recommend HTTPX even for smaller projects since you can easily scale up in the future without compromising performance.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;HTTPX&lt;/th&gt;
&lt;th&gt;Requests&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Asynchronous&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTTP/2 support&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timeout support&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Proxy support&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TLS verification&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom exceptions&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Parsing HTML with Beautiful Soup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once you have the HTML content, you need a way to parse it and extract the data you're interested in.&lt;/p&gt;

&lt;p&gt;Beautiful Soup is the most popular HTML parser in Python, allowing you to easily navigate and search through the HTML tree structure. Its straightforward syntax and easy setup also make Beautiful Soup a great option for small to medium web scraping projects as well as web scraping beginners.&lt;/p&gt;

&lt;p&gt;The two major drawbacks of Beautiful Soup are its inability to scrape JavaScript-heavy websites and its limited scalability, which results in low performance in large-scale projects. For large projects, you would be better off using Scrapy, but more about that later.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-with-beautiful-soup/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--onzJfXsH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/03/6502423.jpg" height="800" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-with-beautiful-soup/" rel="noopener noreferrer" class="c-link"&gt;
          Web scraping with Beautiful Soup and Requests
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Detailed tutorial with code examples. And some handy tricks.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://res.cloudinary.com/practicaldev/image/fetch/s--q_zdUqT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png" width="128" height="128"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;Next, lets take a look at how Beautiful Soup works in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;httpx&lt;/span&gt;

&lt;span class="c1"&gt;# Send an HTTP GET request to the specified URL using the httpx library
&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;https://news.ycombinator.com/news&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save the content of the response
&lt;/span&gt;
&lt;span class="n"&gt;yc_web_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="c1"&gt;# Use the BeautifulSoup library to parse the HTML content of the webpage
&lt;/span&gt;
&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yc_web_page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Find all elements with the class "athing" (which represent articles on Hacker News) using the parsed HTML
&lt;/span&gt;
&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"athing"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Loop through each article and extract relevant data, such as the URL, title, and rank
&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="s"&gt;"URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"titleline"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"a"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'href'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# Find the URL of the article by finding the first "a" tag within the element with class "titleline"
&lt;/span&gt;
&lt;span class="s"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"titleline"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="c1"&gt;# Find the title of the article by getting the text content of the element with class "titleline"
&lt;/span&gt;
&lt;span class="s"&gt;"rank"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"rank"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Find the rank of the article by getting the text content of the element with class "rank" and removing the period character
&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Print the extracted data for the current article
&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Explaining the code:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;1 - We start by sending an HTTP GET request to the specified URL using the HTTPX library. Then, we save the retrieved content to a variable.&lt;/p&gt;

&lt;p&gt;2 - Now, we use the Beautiful Soup library to parse the HTML content of the webpage.&lt;/p&gt;

&lt;p&gt;3 - This enables us to manipulate the parsed content using Beautiful Soup methods, such as &lt;code&gt;find_all&lt;/code&gt; to find the content we need. In this particular case, we are finding all elements with the class &lt;code&gt;athing&lt;/code&gt;, which represents articles on Hacker News.&lt;/p&gt;

&lt;p&gt;4- Next, we simply loop through all the articles on the page and then use CSS selectors to further specify what data we would to extract from each article. Finally, we print the scraped data to the console.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Browser automation libraries - Selenium and Playwright&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;What if the website you're scraping relies on JavaScript to load its content? In that case, an HTML parser won't be enough, as you'll need to generate a browser instance to load the pages JavaScript using a browser automation tool like &lt;a href="https://blog.apify.com/playwright-vs-selenium-webscraping/"&gt;Selenium or Playwright&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;These are primarily testing and automation tools that allow you to control a web browser programmatically, including clicking buttons, filling out forms, and more. However, they are also often used in web scraping as a means to access dynamically generated data on a webpage.&lt;/p&gt;

&lt;p&gt;While Selenium and Playwright are very similar in their core functionality, Playwright is more modern and complete than Selenium.&lt;/p&gt;

&lt;p&gt;For example, Playwright offers some unique built-in features, such as automatically waiting on elements to be visible before making actions and an asynchronous version of its API using &lt;code&gt;asyncIO&lt;/code&gt;.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/what-is-playwright/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--dCMcBjS3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2022/10/Playwright-automation.jpg" height="600" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/what-is-playwright/" rel="noopener noreferrer" class="c-link"&gt;
          What is Playwright automation?
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Learn why Playwright is ideal for web scraping and automation.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://res.cloudinary.com/practicaldev/image/fetch/s--q_zdUqT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png" width="128" height="128"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;To exemplify how we can use Playwright to do web scraping, lets quickly walk through a code snippet where we use Playwright to extract data from an Amazon product and save a screenshot of the page while at it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;asyncio&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create a dictionary with the scraped data
&lt;/span&gt;
&lt;span class="n"&gt;selectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'#productTitle'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'span.author a'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'#productSubtitle'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'.a-size-base.a-color-price.a-color-price'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;book_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sel&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sel&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;selectors&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;book&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

&lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"book_title"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"author"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"edition"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;book_data&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"book.png"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Explaining the code:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Import the necessary modules: &lt;code&gt;asyncio&lt;/code&gt; and &lt;code&gt;async_playwright&lt;/code&gt; from Playwright's async API.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After importing the necessary modules, we start by defining an async function called &lt;code&gt;main&lt;/code&gt; that launches a Firefox browser instance with &lt;code&gt;headless&lt;/code&gt; mode set to &lt;code&gt;False&lt;/code&gt; so we can actually see the browser working. Creates a new page in the browser using the &lt;code&gt;new_page&lt;/code&gt; method and finally navigates to the Amazon website using the &lt;code&gt;goto&lt;/code&gt;method.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Next, we define a list of CSS selectors for the data we want to be scraped. Then, we can use the method &lt;code&gt;asyncio.gather&lt;/code&gt; to simultaneously execute the &lt;code&gt;page.query_selector&lt;/code&gt; method on all the selectors in the list, and store the results in a &lt;code&gt;book_data&lt;/code&gt; variable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Now we can iterate over &lt;code&gt;book_data&lt;/code&gt; to populate the &lt;code&gt;book&lt;/code&gt; dictionary with the scraped data. Note that we also check that the element is not &lt;code&gt;None&lt;/code&gt; and only add the elements which exist. This is considered good practice since websites can make small changes that will affect your scraper. You could even expand on this example and write more complex tests to ensure the data being extracted is not missing any values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Finally, we print the &lt;code&gt;book&lt;/code&gt; dictionary contents to the console and take a screenshot of the scraped page, saving it as a file called &lt;code&gt;book.png&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;As a last step, we make sure to close the browser instance.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/how-to-scrape-the-web-with-playwright-ece1ced75f73/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--w_BPB1GY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/03/How_to_scrape_web_with_Playwright.png" height="450" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/how-to-scrape-the-web-with-playwright-ece1ced75f73/" rel="noopener noreferrer" class="c-link"&gt;
          How to scrape the web with Playwright in 2023
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Complete Playwright web scraping and crawling tutorial.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://res.cloudinary.com/practicaldev/image/fetch/s--q_zdUqT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png" width="128" height="128"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;p&gt;But wait! If browser automation tools can be used to scrape virtually any webpage and, on top of that, can also make it easier for you to automate tasks, test and visualize your code working, why dont we just always use Playwright or Selenium for web scraping?&lt;/p&gt;

&lt;p&gt;Well, despite being powerful scraping tools, these libraries and frameworks have a noticeable drawback. It turns out that &lt;strong&gt;generating a browser instance is a very resource-heavy action when compared to simply retrieving the pages HTML&lt;/strong&gt;. This can easily become a huge performance bottleneck for large scraping jobs, which will not only take longer to complete but also become considerably more expensive. For that reason, we usually want to limit the usage of these tools to only the necessary tasks and, when possible, &lt;strong&gt;use them together with&lt;/strong&gt; &lt;a href="https://blog.apify.com/beautiful-soup-vs-scrapy-web-scraping/"&gt;&lt;strong&gt;Beautiful Soup or Scrapy&lt;/strong&gt;&lt;/a&gt;.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-with-scrapy/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--hcYQnnyB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/04/SCRAPING....png" height="450" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-with-scrapy/" rel="noopener noreferrer" class="c-link"&gt;
          Web Scraping with Scrapy
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          A hands-on guide for web scraping with Scrapy.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://res.cloudinary.com/practicaldev/image/fetch/s--q_zdUqT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png" width="128" height="128"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Scrapy&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Next up, we have the most popular and, arguably, powerful web scraping framework for Python.&lt;/p&gt;

&lt;p&gt;If you find yourself needing to scrape large amounts of data regularly, then Scrapy could be a great option.&lt;/p&gt;

&lt;p&gt;The Scrapy framework offers a full-fledged suite of tools to aid you even in the most complex scraping jobs.&lt;/p&gt;

&lt;p&gt;On top of its superior performance when compared to Beautiful Soup, Scrapy can also be easily integrated into other data-processing Python tools and even other libraries, such as Playwright.&lt;/p&gt;

&lt;p&gt;Not only that, but it comes with a handy collection of built-in features catered specifically to web scraping, such as:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Powerful and flexible spidering framework&lt;/td&gt;
&lt;td&gt;Scrapy provides a built-in spidering framework that allows you to easily define and customize web crawlers to extract the data you need.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast and efficient&lt;/td&gt;
&lt;td&gt;Scrapy is designed to be fast and efficient, allowing you to extract data from large websites quickly and with minimal resource usage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support for handling common web data formats&lt;/td&gt;
&lt;td&gt;Export data in multiple formats such as HTML, XML, and JSON.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extensible architecture&lt;/td&gt;
&lt;td&gt;Easily add custom functionality through middleware, pipelines, and extensions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distributed scraping&lt;/td&gt;
&lt;td&gt;Scrapy supports distributed scraping, allowing you to scale up your web scraping operation across multiple machines.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error handling&lt;/td&gt;
&lt;td&gt;Scrapy has robust error-handling capabilities, allowing you to handle common errors and exceptions that may occur during web scraping.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support for authentication and cookies&lt;/td&gt;
&lt;td&gt;Supports handling authentication and cookies to scrape websites that require login credentials.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration with other Python tools&lt;/td&gt;
&lt;td&gt;Scrapy can be easily integrated with other Python tools, such as data processing and storage libraries, making it a powerful tool for end-to-end data processing pipelines.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's an example of how to use a Scrapy Spider to scrape data from a website:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HackernewsSpiderSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'hackernews_spider'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'news.ycombinator.com'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'&amp;lt;http://news.ycombinator.com/&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'tr.athing'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="s"&gt;"URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".titleline a::attr(href)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="s"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".titleline a::text"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="s"&gt;"rank"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".rank::text"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;We can use the following command to run this script and save the resulting data to a JSON file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt; &lt;span class="n"&gt;crawl&lt;/span&gt; &lt;span class="n"&gt;hackernews&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="n"&gt;hackernews&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Explaining the code:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The code example uses Scrapy to scrape data from the Hacker News website (&lt;a href="http://news.ycombinator.com"&gt;news.ycombinator.com&lt;/a&gt;). Let's break down the code step by step:&lt;/p&gt;

&lt;p&gt;After importing the necessary modules, we define the Spider class we want to use:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HackernewsSpiderSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Next, we set the Spider properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;name&lt;/code&gt;: The name of the spider (used to identify it).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;allowed_domains&lt;/code&gt;: A list of domains that the spider is allowed to crawl&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;start_urls&lt;/code&gt;: A list of URLs to start crawling from.&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'hackernews_spider'&lt;/span&gt;
&lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'news.ycombinator.com'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;'&amp;lt;http://news.ycombinator.com/&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then, we define the &lt;code&gt;parse&lt;/code&gt; method: This method is the entry point for the spider and is called with the response of the URLs specified in &lt;code&gt;start_urls&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In the parse method, we will extract data from the HTML response: The &lt;code&gt;response&lt;/code&gt; object represents the HTML page received from the website. The spider uses CSS selectors to extract relevant data from the HTML structure.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'tr.athing'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now we use a for loop to iterate over each article found on the page.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Finally, for each article, the spider extracts the URL, title, and rank information using CSS selectors and yields a Python dictionary containing this data.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="s"&gt;"URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".titleline a::attr(href)"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="s"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".titleline a::text"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="s"&gt;"rank"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".rank::text"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/alternatives-scrapy-web-scraping/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://res.cloudinary.com/practicaldev/image/fetch/s--V1ydlS9C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/2023/02/Scrapy-alternatives-for-web-scraping-2.png" height="450" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/alternatives-scrapy-web-scraping/" rel="noopener noreferrer" class="c-link"&gt;
          Scrapy alternatives: other web scraping libraries to try
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          5 Scrapy alternatives for web scraping you need to try.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://res.cloudinary.com/practicaldev/image/fetch/s--q_zdUqT4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png" width="128" height="128"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Which Python scraping library is right for you?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;So, which library should you use for your web scraping project? The answer depends on the specific needs and requirements of your project. Each web scraping library and framework presented here has a unique purpose in an expert scraper's toolkit. Learning to use each one will give you the flexibility to select the best tool for each job, so don't be afraid to try each of them before deciding!&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/C8DmvJQS3jk"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Whether you are scraping with BeautifulSoup, Scrapy, Selenium, or Playwright, the Apify Python SDK helps you run your project in the cloud at any scale.&lt;/p&gt;

</description>
      <category>python</category>
      <category>webscraping</category>
      <category>selenium</category>
      <category>playwright</category>
    </item>
    <item>
      <title>How to parse JSON with Python</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Thu, 18 May 2023 14:04:51 +0000</pubDate>
      <link>https://dev.to/apify/how-to-parse-json-with-python-412a</link>
      <guid>https://dev.to/apify/how-to-parse-json-with-python-412a</guid>
      <description>&lt;p&gt;Understand JSON structure and syntax, and learn how to parse JSON strings and files using Python's built-in json module and convert JSON files using Pandas.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is JSON?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON?ref=blog.apify.com" rel="noopener noreferrer"&gt;JSON (JavaScript Object Notation)&lt;/a&gt; is a lightweight data-interchange format that is easy for humans to read and write while also being easy for machines to parse and generate. It is widely used for transmitting data between a client and a server, as an alternative to XML.&lt;/p&gt;

&lt;p&gt;JSON data is represented as a collection of key-value pairs, where the keys are strings and the values can be any valid JSON data type, such as a &lt;code&gt;string&lt;/code&gt;, &lt;code&gt;number&lt;/code&gt;, &lt;code&gt;boolean&lt;/code&gt;, &lt;code&gt;null&lt;/code&gt;, &lt;code&gt;array&lt;/code&gt;, or &lt;code&gt;object&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"John Doe"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"city"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"New York"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;age&lt;/code&gt;, and &lt;code&gt;city&lt;/code&gt; are the keys, and "John Doe", 30, and "New York" are the corresponding values.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;How to parse JSON strings in Python&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To parse a JSON string in Python, we can use the built-in &lt;code&gt;json&lt;/code&gt; module. This module provides two methods for working with JSON data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;json.loads()&lt;/code&gt; parses a JSON string and returns a Python object.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;json.dumps()&lt;/code&gt; takes a Python object and returns a JSON string.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is an example of how to use &lt;code&gt;json.loads()&lt;/code&gt; to parse a JSON string:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# JSON string
&lt;/span&gt;&lt;span class="n"&gt;json_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: 30, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="c1"&gt;# parse JSON string
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# print Python object
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we import the &lt;code&gt;json&lt;/code&gt; module, define a JSON string, and use &lt;code&gt;json.loads()&lt;/code&gt; to parse it into a Python object. We then print the resulting Python object.&lt;/p&gt;

&lt;p&gt;Note that &lt;code&gt;json.loads()&lt;/code&gt; will raise a &lt;code&gt;json.decoder.JSONDecodeError&lt;/code&gt; exception if the input string is not valid JSON.&lt;/p&gt;

&lt;p&gt;After running the script above we can expect to get the following output printed to the console:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;'name':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'John'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'age':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'city':&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;'New&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;York'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to read and parse JSON files in Python&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To parse a JSON file in Python, we can use the same &lt;code&gt;json&lt;/code&gt; module we used in the previous section. The only difference is that instead of passing a JSON string to &lt;code&gt;json.loads()&lt;/code&gt;, we pass the contents of a JSON file.&lt;/p&gt;

&lt;p&gt;For example, assume we have a file named &lt;code&gt;**data.json**&lt;/code&gt; that we would like to parse and read. Here's how we would do it:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# open JSON file
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# parse JSON data
&lt;/span&gt;    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# print Python object
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we use the &lt;code&gt;open()&lt;/code&gt; function to open a JSON target file called &lt;code&gt;data.json&lt;/code&gt; in read mode. We then pass the file object to &lt;code&gt;json.load()&lt;/code&gt;, which parses the JSON data and returns a Python object. We then print the resulting Python object.&lt;/p&gt;

&lt;p&gt;Note that if the JSON file is not valid JSON, &lt;code&gt;json.load()&lt;/code&gt; will raise a &lt;code&gt;json.decoder.JSONDecodeError&lt;/code&gt; exception.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;How to pretty print JSON data in Python&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When working with JSON data in Python, it can often be helpful to &lt;em&gt;pretty print&lt;/em&gt; the data, which means to format it in a more human-readable way. The &lt;code&gt;json&lt;/code&gt; module provides a method called &lt;code&gt;json.dumps()&lt;/code&gt; that can be used to pretty print JSON data.&lt;/p&gt;

&lt;p&gt;Here is an example of how to pretty print JSON data in Python:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# define JSON data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hobbies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reading&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traveling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cooking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# pretty print JSON data
&lt;/span&gt;&lt;span class="n"&gt;pretty_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# print pretty JSON
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pretty_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hobbies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reading&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traveling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cooking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we define a Python dictionary representing JSON data, and then use &lt;code&gt;json.dumps()&lt;/code&gt; with the &lt;code&gt;indent&lt;/code&gt; argument set to 4 to pretty print the data. We then print the resulting pretty printed JSON string.&lt;/p&gt;

&lt;p&gt;Note that &lt;code&gt;indent&lt;/code&gt; is an optional argument to &lt;code&gt;json.dumps()&lt;/code&gt; that specifies the number of spaces to use for indentation. If &lt;code&gt;indent&lt;/code&gt; is not specified, the JSON data will be printed without any indentation.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;How to parse JSON with Python Pandas&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In addition to the built-in &lt;code&gt;json&lt;/code&gt; package, we can also use &lt;code&gt;pandas&lt;/code&gt; to parse and work with JSON data in Python. &lt;code&gt;pandas&lt;/code&gt; provides a method called &lt;a href="http://pandas.read" rel="noopener noreferrer"&gt;&lt;code&gt;pandas.read&lt;/code&gt;&lt;/a&gt;&lt;code&gt;_json()&lt;/code&gt; that can read JSON data into a DataFrame.&lt;/p&gt;

&lt;p&gt;Compared to using the built-in &lt;code&gt;json&lt;/code&gt; package, working with &lt;code&gt;pandas&lt;/code&gt; can be easier and more convenient when we want to analyze and manipulate the data further, as it allows us to use the powerful and flexible &lt;code&gt;DataFrame&lt;/code&gt; object.&lt;/p&gt;

&lt;p&gt;Here is an example of how to parse JSON data with &lt;code&gt;pandas&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# define JSON data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;London&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# convert JSON to DataFrame using pandas
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# print DataFrame
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
   name age city
0 John 30 New York
1 Jane 25 London
2 Bob 35 Paris

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we define a Python dictionary representing JSON data, and use &lt;code&gt;json.dumps()&lt;/code&gt; to convert it to a JSON string. We then use &lt;a href="http://pandas.read" rel="noopener noreferrer"&gt;&lt;code&gt;pandas.read&lt;/code&gt;&lt;/a&gt;&lt;code&gt;_json()&lt;/code&gt; to read the JSON string into a DataFrame. Finally, we print the resulting DataFrame.&lt;/p&gt;

&lt;p&gt;One benefit of using &lt;code&gt;pandas&lt;/code&gt; to parse JSON data is that we can easily manipulate the resulting DataFrame, for example by selecting columns, filtering rows, or grouping data.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="c1"&gt;# define JSON data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;London&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# convert JSON to DataFrame using pandas
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# select columns
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

&lt;span class="c1"&gt;# filter rows
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# print resulting DataFrame
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;  name age
2 Bob 35

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we select only the &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;age&lt;/code&gt; columns from the DataFrame, and filter out any rows where the age is less than or equal to 30.&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;pandas&lt;/code&gt; to parse and work with JSON data in Python can be a convenient and powerful alternative to using the built-in &lt;code&gt;json&lt;/code&gt; package. It allows us to easily manipulate and analyze the data using the &lt;code&gt;DataFrame&lt;/code&gt; object, which offers a rich set of functionality for working with tabular data.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;How to convert JSON to CSV in Python&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Sometimes we might want to convert JSON data into a CSV format. Luckily, the &lt;code&gt;pandas&lt;/code&gt; library can also help us with that.&lt;/p&gt;

&lt;p&gt;We can use the &lt;a href="http://pandas.read" rel="noopener noreferrer"&gt;&lt;code&gt;pandas.read&lt;/code&gt;&lt;/a&gt;&lt;code&gt;_json()&lt;/code&gt; to read JSON data into a DataFrame, followed by a method called &lt;a href="http://DataFrame.to" rel="noopener noreferrer"&gt;&lt;code&gt;DataFrame.to&lt;/code&gt;&lt;/a&gt;&lt;code&gt;_csv()&lt;/code&gt; to write the DataFrame to a CSV file.&lt;/p&gt;

&lt;p&gt;Here is an example of how to convert JSON data to CSV in Python using &lt;code&gt;pandas&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# define JSON data
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Jane&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;New York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;London&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Paris&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# convert JSON to DataFrame
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# write DataFrame to CSV file
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# read CSV file
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# print DataFrame
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   name age city
0 John 30 New York
1 Jane 25 London
2 Bob 35 Paris

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we define a Python dictionary representing JSON data, and use &lt;code&gt;json.dumps()&lt;/code&gt; to convert it to a JSON string. We then use &lt;a href="http://pandas.read" rel="noopener noreferrer"&gt;&lt;code&gt;pandas.read&lt;/code&gt;&lt;/a&gt;&lt;code&gt;_json()&lt;/code&gt; to read the JSON string into a DataFrame, and use &lt;a href="http://DataFrame.to" rel="noopener noreferrer"&gt;&lt;code&gt;DataFrame.to&lt;/code&gt;&lt;/a&gt;&lt;code&gt;_csv()&lt;/code&gt; to write the DataFrame to a CSV file. We then use &lt;a href="http://pandas.read" rel="noopener noreferrer"&gt;&lt;code&gt;pandas.read&lt;/code&gt;&lt;/a&gt;&lt;code&gt;_csv()&lt;/code&gt; to read the CSV file back into a DataFrame, and print the resulting DataFrame.&lt;/p&gt;

&lt;p&gt;Note that when calling &lt;code&gt;to_csv()&lt;/code&gt;, we pass &lt;code&gt;index=False&lt;/code&gt; to exclude the row index from the output CSV file.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-python/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2Fsize%2Fw1200%2F2024%2F02%2FPython-web-scraping_-a-comprehensive-guide.png" height="450" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/web-scraping-python/" rel="noopener noreferrer" class="c-link"&gt;
          Python web scraping tutorial
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          How to scrape &amp;amp; parse data with Python (with code examples)
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2Fsize%2Fw256h256%2F2025%2F07%2Ffavicon.png" width="48" height="48"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>python</category>
      <category>pandas</category>
      <category>json</category>
    </item>
    <item>
      <title>Web Scraping with Scrapy</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Mon, 17 Apr 2023 18:57:21 +0000</pubDate>
      <link>https://dev.to/apify/web-scraping-with-scrapy-40em</link>
      <guid>https://dev.to/apify/web-scraping-with-scrapy-40em</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;👋 Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is Scrapy?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/scrapy/scrapy?ref=blog.apify.com" rel="noopener noreferrer"&gt;Scrapy&lt;/a&gt; is an open-source web scraping framework written in Python that provides an easy-to-use API for web scraping, as well as built-in functionality for handling large-scale web scraping projects, support for different types of data extraction, and the ability to work with different web protocols.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why use Scrapy?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Scrapy is the preferred tool for large-scale scraping projects due to its &lt;a href="https://blog.apify.com/beautiful-soup-vs-scrapy-web-scraping" rel="noopener noreferrer"&gt;advantages over other popular Python web scraping libraries&lt;/a&gt; such as BeautifulSoup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/?ref=blog.apify.com" rel="noopener noreferrer"&gt;BeautifulSoup&lt;/a&gt; is primarily a parser library, whereas Scrapy is a complete web scraping framework with handy built-in functionalities such as dedicated spider types for different scraping tasks and the ability to extend Scrapys functionality by using middleware and exporting data to different formats.&lt;/p&gt;

&lt;p&gt;Some real-world examples where Scrapy can be useful include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;E-commerce websites:&lt;/strong&gt; Scrapy can be used to extract product information such as prices, descriptions, and reviews from e-commerce websites such as Amazon, Walmart, and Target.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Social media:&lt;/strong&gt; Scrapy can be used to extract data such as public user information and posts from popular social media websites like Twitter, Facebook, and Instagram.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Job boards:&lt;/strong&gt; Scrapy can be used to monitor job board websites such as Indeed, Glassdoor, and LinkedIn for relevant job postings.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's important to note that Scrapy has some limitations. For example, it cannot scrape JavaScript-heavy websites. However, we can easily overcome this limitation by using Scrapy alongside other tools like &lt;a href="https://blog.apify.com/playwright-vs-selenium-webscraping/" rel="noopener noreferrer"&gt;Selenium or Playwright&lt;/a&gt; to tackle those sites.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/beautiful-soup-vs-scrapy-web-scraping/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2Fsize%2Fw1200%2F2024%2F02%2FScrapy-vs.-Beautiful-Soup--which-one-to-choose-for-web-scraping.png" height="449" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/beautiful-soup-vs-scrapy-web-scraping/" rel="noopener noreferrer" class="c-link"&gt;
          Scrapy vs. Beautiful Soup for web scraping
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Learn the differences between these Python scraping libraries.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2Fsize%2Fw256h256%2F2025%2F07%2Ffavicon.png" width="48" height="48"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;Alright, now that we have a good idea of what Scrapy is and why it's useful, let's dive deeper into Scrapy's main features.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🎁 Exploring Scrapy Features&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Types of Spiders 🕷&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;One of the key features of Scrapy is the ability to create different &lt;a href="https://docs.scrapy.org/en/latest/topics/spiders.html?ref=blog.apify.com" rel="noopener noreferrer"&gt;types of spiders&lt;/a&gt;. Spiders are essentially the backbone of Scrapy and are responsible for parsing websites and extracting data. There are three main types of spiders in Scrapy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Spider&lt;/strong&gt; : The base class for all spiders. This is the simplest type of spider and is used for extracting data from a single page or a small set of pages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CrawlSpider&lt;/strong&gt; : A more advanced type of spider that is used for extracting data from multiple pages or entire websites. CrawlSpider automatically follows links and extracts data from each page it visits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SitemapSpider&lt;/strong&gt; : A specialized type of spider that is used for extracting data from websites that have a sitemap.xml file. SitemapSpider automatically visits each URL in the sitemap and extracts data from it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is an example of how to create a &lt;strong&gt;basic Spider&lt;/strong&gt; in Scrapy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myspider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;http://example.com&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# extract data from response
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This spider, named &lt;code&gt;myspider&lt;/code&gt;, will start by requesting the URL &lt;a href="http://example.com" rel="noopener noreferrer"&gt;&lt;code&gt;http://example.com&lt;/code&gt;&lt;/a&gt;. The &lt;code&gt;parse&lt;/code&gt; method is where you would write code to extract data from the response.&lt;/p&gt;

&lt;p&gt;Here is an example of how to create a &lt;strong&gt;CrawlSpider&lt;/strong&gt; in Scrapy:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.linkextractors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinkExtractor&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.spiders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rule&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyCrawlSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mycrawlspider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;http://example.com&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_item&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# extract data from response
&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This spider, named &lt;code&gt;mycrawlspider&lt;/code&gt;, will start by requesting the URL &lt;a href="http://example.com" rel="noopener noreferrer"&gt;&lt;code&gt;http://example.com&lt;/code&gt;&lt;/a&gt;. The &lt;code&gt;rules&lt;/code&gt; list contains one &lt;code&gt;Rule&lt;/code&gt; object that tells the spider to follow all links and call the &lt;code&gt;parse_item&lt;/code&gt; method on each response.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Extending Scrapy with Middlewares 🔗&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Middlewares allow us to &lt;a href="https://docs.scrapy.org/en/latest/topics/architecture.html?ref=blog.apify.com" rel="noopener noreferrer"&gt;extend Scrapys functionality&lt;/a&gt;. Scrapy comes with several built-in middlewares that can be used out of the box.&lt;/p&gt;

&lt;p&gt;Additionally, we can also write your own custom middleware to perform tasks like modifying request headers, logging, or handling exceptions. So, lets take a look at some of the most commonly used Scrapy middlewares:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;UserAgentMiddleware:&lt;/strong&gt; This middleware allows you to set a custom User-Agent header for each request. This is useful for avoiding detection by websites that may block scraping bots based on the User-Agent header. To use this middleware, we can set it up on our Scrapy settings file like this:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;DOWNLOADER_MIDDLEWARES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy.downloadermiddlewares.useragent.UserAgentMiddleware&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy.downloadermiddlewares.useragent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we're using a priority of &lt;code&gt;500&lt;/code&gt; for the built-in &lt;code&gt;UserAgentMiddleware&lt;/code&gt; to ensure that it runs before other downloader middlewares.&lt;/p&gt;

&lt;p&gt;By default, &lt;code&gt;UserAgentMiddleware&lt;/code&gt; sets the &lt;code&gt;User-Agent&lt;/code&gt; header for each request to a randomly chosen user-agent string. You can customize the user agent strings used by setting the &lt;code&gt;USER_AGENT&lt;/code&gt; setting in your Scrapy settings.&lt;/p&gt;

&lt;p&gt;Note that we first set &lt;code&gt;UserAgentMiddleware&lt;/code&gt; to &lt;code&gt;None&lt;/code&gt; before adding it to the &lt;code&gt;DOWNLOADER_MIDDLEWARES&lt;/code&gt; setting with a different priority.&lt;/p&gt;

&lt;p&gt;This is because the default &lt;code&gt;UserAgentMiddleware&lt;/code&gt; in Scrapy sets a generic user agent string for all requests, which may not be ideal for some scraping scenarios. If we need to use a custom user agent string, we'll need to customize the &lt;code&gt;UserAgentMiddleware&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Therefore, by setting &lt;code&gt;UserAgentMiddleware&lt;/code&gt; to &lt;code&gt;None&lt;/code&gt; first, we're telling Scrapy to remove the default &lt;code&gt;UserAgentMiddleware&lt;/code&gt; from the &lt;code&gt;DOWNLOADER_MIDDLEWARES&lt;/code&gt; setting before adding our own custom instance of the middleware with a different priority.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RetryMiddleware:&lt;/strong&gt; Scrapy comes with a &lt;code&gt;RetryMiddleware&lt;/code&gt; that can be used to retry failed requests. By default, it retries requests with HTTP status codes 500, 502, 503, 504, 408, and when an exception is raised. You can customize the behavior of this middleware by specifying the &lt;code&gt;RETRY_TIMES&lt;/code&gt; and &lt;code&gt;RETRY_HTTP_CODES&lt;/code&gt; settings. To use this middleware in its default configuration, you can simply add it to your Scrapy settings:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;DOWNLOADER_MIDDLEWARES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy.downloadermiddlewares.retry.RetryMiddleware&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;550&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HttpProxyMiddleware:&lt;/strong&gt; This middleware allows you to use proxies to send requests. This is useful for avoiding detection and bypassing IP rate limits. To use this middleware, we can add it to our Scrapy settings file like this:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;DOWNLOADER_MIDDLEWARES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;110&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;myproject.middlewares.ProxyMiddleware&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;PROXY_POOL_ENABLED&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will enable the &lt;code&gt;HttpProxyMiddleware&lt;/code&gt; and also enable the &lt;code&gt;ProxyMiddleware&lt;/code&gt; that we define. This middleware will select a random proxy for each request from a pool of proxies provided by the user.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CookiesMiddleware:&lt;/strong&gt; This middleware allows you to handle cookies sent by websites. By default, Scrapy stores cookies in memory, but you can also store them in a file or a database by specifying the &lt;code&gt;COOKIES_STORAGE&lt;/code&gt; in the Scrapy settings. To add **&lt;code&gt;CookiesMiddleware&lt;/code&gt;**to the **&lt;code&gt;DOWNLOADER_MIDDLEWARES&lt;/code&gt;**setting, we simply specify the middleware class and its priority. In this case, we're using a priority of &lt;code&gt;700&lt;/code&gt;, which should be after the default &lt;code&gt;UserAgentMiddleware&lt;/code&gt; and **&lt;code&gt;RetryMiddleware&lt;/code&gt;**but before any custom middleware.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;DOWNLOADER_MIDDLEWARES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scrapy.downloadermiddlewares.cookies.CookiesMiddleware&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;700&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now we can use &lt;code&gt;CookiesMiddleware&lt;/code&gt; to handle cookies sent by the website:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;myspider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://www.example.com/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;start_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Send an initial request without cookies
&lt;/span&gt;            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract cookies from the response headers
&lt;/span&gt;        &lt;span class="n"&gt;cookies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getlist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Set-Cookie&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cookie&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="c1"&gt;# Send a new request with the cookies received
&lt;/span&gt;        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://www.example.com/protected&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;cookies&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_protected&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_protected&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Process the protected page here
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When the spider sends an initial request to &lt;a href="https://www.example.com/" rel="noopener noreferrer"&gt;&lt;code&gt;https://www.example.com/&lt;/code&gt;&lt;/a&gt;, we're not sending any cookies yet. When we receive the response, we extract the cookies from the response headers and send a new request to a protected page with the received cookies.&lt;/p&gt;

&lt;p&gt;These are just a few of the uses for middlewares in Scrapy. The beauty of middlewares is that we are able to write our own custom middleware to continue expanding Scrapys features and performing additional tasks to fit our specific use cases.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Exporting Scraped Data 📤&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Scrapy provides built-in support for &lt;a href="https://docs.scrapy.org/en/latest/topics/feed-exports.html?ref=blog.apify.com" rel="noopener noreferrer"&gt;exporting scraped data&lt;/a&gt; in different formats, such as CSV, JSON, and XML. You can also create your own custom exporters to store data in different formats.&lt;/p&gt;

&lt;p&gt;Heres an example of how to store scraped data in a CSV file in Scrapy:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note that this is a very basic example, and the&lt;/strong&gt; &lt;code&gt;closed&lt;/code&gt; method could be modified to handle errors and ensure that the file is closed properly. Also, the code is merely explanatory, and you will have to adapt it to make it work for your use case.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.exporters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CsvItemExporter&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MySpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;example&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://www.example.com&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;//div[@class=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;item&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.//h2/text()&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.//p/text()&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;closed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;example.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w+b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CsvItemExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fields_to_export&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export_item&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this example, we define a spider that starts by scraping the "&lt;a href="https://www.example.com" rel="noopener noreferrer"&gt;&lt;strong&gt;https://www.example.com&lt;/strong&gt;&lt;/a&gt;" URL. We then define a &lt;code&gt;parse&lt;/code&gt; method that extracts the title, price, and description for each item on the page. Finally, in the &lt;code&gt;closed&lt;/code&gt; method, we define a filename for the CSV file and export the scraped data using the &lt;code&gt;CsvItemExporter&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Another way of exporting extracted data in different formats using Scrapy is to use the &lt;code&gt;scrapy crawl&lt;/code&gt; command and specify the desired file format of our output. This can be done by appending the &lt;code&gt;-o&lt;/code&gt; flag followed by the filename and extension of the output file.&lt;/p&gt;

&lt;p&gt;For example, if we want to output our scraped data in JSON format, we would use the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl myspider &lt;span class="nt"&gt;-o&lt;/span&gt; output.json

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will store the scraped data in a file named &lt;code&gt;output.json&lt;/code&gt; in the same directory where the command was executed. Similarly, if we want to output the data in CSV format, we would use the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl myspider &lt;span class="nt"&gt;-o&lt;/span&gt; output.csv

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will store the scraped data in a file named &lt;code&gt;output.csv&lt;/code&gt; in the same directory where the command was executed.&lt;/p&gt;

&lt;p&gt;Overall, Scrapy provides multiple ways to store and export scraped data, giving us the flexibility to choose the most appropriate method for our particular situation.&lt;/p&gt;

&lt;p&gt;Now that we have a better understanding of what is possible with Scrapy, let's explore how we can use this framework to extract data from real websites. We'll do this by building a few small projects, each showcasing a different Scrapy feature.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;🛠 Project: Building a Hacker News Scraper using a basic Spider&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this section, we will learn how to set up a Scrapy project and create a &lt;a href="https://docs.scrapy.org/en/latest/topics/spiders.html?ref=blog.apify.com#scrapy-spider" rel="noopener noreferrer"&gt;basic Spider&lt;/a&gt; to scrape the title, author, URL, and points of all articles displayed on the first page of the Hacker News website.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Creating a Scrapy Project&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before we can generate a Spider, we need to create a new Scrapy project. To do this, we'll use the terminal. Open a terminal window and navigate to the directory where you want to create your project. Start by installing Scrapy:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then run the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject hackernews

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This command will create a new directory called "hackernews" with the basic structure of a Scrapy project.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Creating a Spider&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now that we have a Scrapy project set up, we can create a spider to scrape the data we want. In the same terminal window, navigate to the project directory using &lt;code&gt;cd hackernews&lt;/code&gt; and run the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy genspider hackernews_spider news.ycombinator.com

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This command will create a new spider in the &lt;code&gt;spiders&lt;/code&gt; directory of our project. We named the spider &lt;code&gt;hackernews_spider&lt;/code&gt; and set the start URL to &lt;a href="http://news.ycombinator.com" rel="noopener noreferrer"&gt;&lt;code&gt;news.ycombinator.com&lt;/code&gt;&lt;/a&gt;, which is our target website.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Writing the Spider Code&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Next, lets open the &lt;code&gt;hackernews_spider.py&lt;/code&gt; file in the &lt;code&gt;spiders&lt;/code&gt; directory of our project. We'll see a basic template for a Scrapy Spider.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HackernewsSpiderSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hackernews_spider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news.ycombinator.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://news.ycombinator.com/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Before we move on, lets quickly break down what were seeing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;name&lt;/code&gt; attribute is the name of the Spider.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;allowed_domains&lt;/code&gt; attribute is a list of domains that the Spider is allowed to scrape.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;start_urls&lt;/code&gt; attribute is a list of URLs that the Spider should start scraping from&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;parse&lt;/code&gt; method is the method that Scrapy calls to handle the response from each URL in the &lt;code&gt;start_urls&lt;/code&gt; list.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cool, now for the fun part. Let's add some code to the &lt;code&gt;parse&lt;/code&gt; method to scrape the data we want.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HackernewsSpiderSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hackernews_spider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news.ycombinator.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://news.ycombinator.com/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tr.athing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rank::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this code, we use the &lt;code&gt;css&lt;/code&gt; method to extract data from the response. We select all the articles on the page using the CSS selector &lt;code&gt;tr.athing&lt;/code&gt;, and then we extract the &lt;strong&gt;title&lt;/strong&gt; , &lt;strong&gt;URL&lt;/strong&gt; , and &lt;strong&gt;rank&lt;/strong&gt; for each article using more specific selectors. Finally, we use the &lt;code&gt;yield&lt;/code&gt; keyword to return a Python dictionary with the scraped data.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Running the Hacker News Spider&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now that our Spider is ready, let's run it and see it in action.&lt;/p&gt;

&lt;p&gt;By default, the data is output to the console, but we can also export it to other formats, such as JSON, CSV, or XML, by specifying the output format when running the scraper. To demonstrate that, lets run our Spider and export the extracted data to a JSON file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl hackernews &lt;span class="nt"&gt;-o&lt;/span&gt; hackernews.json

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will save the data to a file named &lt;code&gt;**hackernews.json**&lt;/code&gt; in the root directory of the project. You can use the same command to export the data to other formats by replacing the file extension with the desired format (e.g., &lt;code&gt;o hackernews.csv&lt;/code&gt; for CSV format).&lt;/p&gt;

&lt;p&gt;That's it for running the spider. In the next section, we'll take a look at how we can use Scrapy's CrawlSpider to extract data from all pages on the Hacker News website.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;🛠 Project: Building a Hacker News Scraper using the CrawlSpider&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The previous section demonstrated how to scrape data from a single page using a basic Spider. While it is possible to write code to paginate through the remaining pages and scrape all the articles on HN using the basic Spider, Scrapy offers us a better solution: the &lt;a href="https://docs.scrapy.org/en/latest/topics/spiders.html?ref=blog.apify.com#crawlspider" rel="noopener noreferrer"&gt;CrawlSpider&lt;/a&gt;. So, without further ado, lets jump straight into the code.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Project Setup&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To start, let's create a new Scrapy project called &lt;code&gt;hackernews_crawlspider&lt;/code&gt; using the following command in your terminal:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject hackernews_crawlspider

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Next, let's create a new spider using the CrawlSpider template. The CrawlSpider is a subclass of the Spider class and is designed for recursively following links and scraping data from multiple pages.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy genspider &lt;span class="nt"&gt;-t&lt;/span&gt; crawl hackernews_spider https://news.ycombinator.com/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This command generates a new spider called "hackernews_spider" in the "spiders" directory of your Scrapy project. It also specifies that the spider should use the CrawlSpider template and start by scraping the homepage of Hacker News.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Code&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our goal with this scraper is to extract the same data from each article that we scraped in the previous section: URL, title, and rank. The difference is that now we will define a set of rules for the scraper to follow when crawling through the website. For example, we will define a rule to tell the scraper where it can find the correct links to paginate through the HN content.&lt;/p&gt;

&lt;p&gt;With this in mind, thats what the final code for our use case will look like:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add imports CrawlSpider, Rule and LinkExtractor 👇
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.spiders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rule&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.linkextractors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinkExtractor&lt;/span&gt;

&lt;span class="c1"&gt;# Change the spider from "scrapy.Spider" to "CrawlSpider"
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HackernewsSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hackernews&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news.ycombinator.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/news&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;custom_settings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DOWNLOAD_DELAY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="c1"&gt;# Add a 1-second delay between requests
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Define a rule that should be followed by the link extractor. 
&lt;/span&gt;    &lt;span class="c1"&gt;# In this case, Scrapy will follow all the links with the "morelink" class
&lt;/span&gt;    &lt;span class="c1"&gt;# And call the "parse_article" function on every crawled page
&lt;/span&gt;    &lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news\\.ycombinator\\.com/news$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_article&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;restrict_css&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.morelink&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_article&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# When using the CrawlSpider we cannot use a parse function called "parse".
&lt;/span&gt;    &lt;span class="c1"&gt;# Otherwise, it will override the default function.
&lt;/span&gt;    &lt;span class="c1"&gt;# So, just rename it to something else, for example, "parse_article"
&lt;/span&gt;    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tr.athing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rank::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now lets break down the code to understand what the CrawlSpider is doing for us in this scenario.&lt;/p&gt;

&lt;p&gt;You may notice that some parts of this code were already generated by the CrawlSpider, while other parts are very similar to what we did when writing the basic Spider.&lt;/p&gt;

&lt;p&gt;The first distinctive piece of code that may catch your attention is the &lt;code&gt;custom_settings&lt;/code&gt; attribute we have included. This adds a 1-second delay between requests. Since we are now sending multiple requests to access different pages on the website, having this additional delay between the requests can be useful in preventing the target website from being overwhelmed with too many requests at once.&lt;/p&gt;

&lt;p&gt;Next, we defined a set of rules to follow when crawling the website using the &lt;code&gt;rules&lt;/code&gt; attribute:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news\\.ycombinator\\.com/news$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_article&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;restrict_css&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.morelink&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_article&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Each rule is defined using the &lt;code&gt;Rule&lt;/code&gt; class, which takes two arguments: a &lt;code&gt;LinkExtractor&lt;/code&gt; instance that defines which links to follow; and a callback function that will be called to process the response from each crawled page. In this case, we have two rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The first rule&lt;/strong&gt; uses a &lt;code&gt;LinkExtractor&lt;/code&gt; instance with an &lt;code&gt;allow&lt;/code&gt; parameter that matches URLs that end with "&lt;a href="http://news.ycombinator.com/news" rel="noopener noreferrer"&gt;news.ycombinator.com/news&lt;/a&gt;". This will match the first page of news articles on Hacker News. We set the &lt;code&gt;callback&lt;/code&gt; parameter to &lt;code&gt;parse_article&lt;/code&gt;, which is the function that will be called to process the response from each page that matches this rule.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The second rule&lt;/strong&gt; uses a &lt;code&gt;LinkExtractor&lt;/code&gt; instance with a &lt;code&gt;restrict_css&lt;/code&gt; parameter that matches the &lt;code&gt;morelink&lt;/code&gt; class. This will match the "More" link at the bottom of each page of news articles on Hacker News. Again, we set the &lt;code&gt;callback&lt;/code&gt; parameter to &lt;code&gt;parse_article&lt;/code&gt; and the &lt;code&gt;follow&lt;/code&gt; parameter to &lt;code&gt;True&lt;/code&gt;, which tells Scrapy to follow links on this page that match the provided selector.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, we defined the &lt;code&gt;parse_article&lt;/code&gt; function, which takes a &lt;code&gt;response&lt;/code&gt; object as its argument. This function is called to process the response from each page that matches one of the rules defined in the &lt;code&gt;rules&lt;/code&gt; attribute.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tr.athing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rank::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In this function, we use the &lt;code&gt;response.css&lt;/code&gt; method to extract data from the HTML of the page. Specifically, we look for all &lt;code&gt;tr&lt;/code&gt; elements with the &lt;code&gt;athing&lt;/code&gt; class and extract the URL, title, and rank of each article. We then use the &lt;code&gt;yield&lt;/code&gt; keyword to return a Python dictionary with this data.&lt;/p&gt;

&lt;p&gt;Remember that the &lt;code&gt;yield&lt;/code&gt; keyword is used instead of &lt;code&gt;return&lt;/code&gt; because Scrapy processes the response asynchronously, and the function can be called multiple times.&lt;/p&gt;

&lt;p&gt;It's also worth noting that we've named the function &lt;code&gt;parse_article&lt;/code&gt; instead of the default &lt;code&gt;parse&lt;/code&gt; function that's used in Scrapy Spiders. This is because when you use the &lt;code&gt;CrawlSpider&lt;/code&gt; class, the default &lt;code&gt;parse&lt;/code&gt; function is used to parse the response from the first page that's crawled. If you define your own &lt;code&gt;parse&lt;/code&gt; function in a &lt;code&gt;CrawlSpider&lt;/code&gt;, it will override the default function, and your spider will not work as expected.&lt;/p&gt;

&lt;p&gt;To avoid this problem, its considered good practice to always name our custom parsing functions something other than &lt;code&gt;parse&lt;/code&gt;. In this case, we've named our function &lt;code&gt;parse_article&lt;/code&gt;, but you could choose any other name that makes sense for your Spider.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Running the CrawlSpider&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Great, now that we understand whats happening in our code, its time to put our spider to the test by running it with the following command:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl hackernews &lt;span class="nt"&gt;-o&lt;/span&gt; hackernews.json

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will start the spider and scrape data from all the news items on all pages of the Hacker News website. We also already took the opportunity to tell Scrapy to output all the scraped data to a JSON file, which will make it easier for us to visualize the obtained results.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;🕸 How to scrape JavaScript-heavy websites&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Scraping JavaScript-heavy websites can be a challenge with Scrapy alone since Scrapy is primarily designed to scrape static HTML pages. However, we can work around this limitation by using a headless browser like Playwright in conjunction with Scrapy to scrape dynamic web pages.&lt;/p&gt;

&lt;p&gt;Playwright is a library that provides a high-level API to control headless Chrome, Firefox, and Safari. By using Playwright, we can programmatically interact with our target web page to simulate user actions and extract data from dynamically loaded elements.&lt;/p&gt;

&lt;p&gt;To use Playwright with Scrapy, we have to create a custom middleware that initializes a Playwright browser instance and retrieves the HTML content of a web page using Playwright. The middleware can then pass the HTML content to Scrapy for parsing and extraction of data.&lt;/p&gt;

&lt;p&gt;Luckily, the &lt;a href="https://github.com/scrapy-plugins/scrapy-playwright?ref=blog.apify.com" rel="noopener noreferrer"&gt;scrapy-playwright&lt;/a&gt; library lets us easily integrate Playwright with Scrapy. In the next section, we will build a small project using this combo to extract data from a JavaScript-heavy website, Mint Mobile. But before we move on, lets first take a quick look at the target webpage and understand why we wouldnt be able to extract the data we want with Scrapy alone.&lt;/p&gt;

&lt;p&gt;Mint Mobile requires JavaScript to load a considerable part of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mint Mobile product page with JavaScript &lt;em&gt;disabled&lt;/em&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxmfjzybwhx7m1stiuzo9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxmfjzybwhx7m1stiuzo9.png" alt="https://blog.apify.com/content/images/2022/12/Google_Pixel_7_Pro_Bundle___Mint_Mobile.png" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mint Mobile product page with JavaScript &lt;em&gt;enabled&lt;/em&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5ksc64u8se30m9leetw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5ksc64u8se30m9leetw.png" alt="https://blog.apify.com/content/images/2022/12/Google_Pixel_7_Pro_Bundle___Mint_Mobile-1.png" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, without JavaScript enabled, we would lose a significant portion of the data we want to extract. Since Scrapy cannot load JavaScript, you could think of the first image with JavaScript disabled as the "Scrapy view," while the second image with JavaScript enabled would be the "Playwright view.&lt;/p&gt;

&lt;p&gt;Cool, now that we know why we need a browser automation library like Playwright to scrape this page, it is time to translate this knowledge into code by building our next project: the Mint Mobile scraper.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;🛠 Project: Building a web scraper using Scrapy and Playwright&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In this project, well scrape a specific product page from the Mint Mobile website: &lt;a href="https://www.mintmobile.com/product/google-pixel-7-pro-bundle/" rel="noopener noreferrer"&gt;https://www.mintmobile.com/product/google-pixel-7-pro-bundle/&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Project setup&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We start by creating a directory to house our project and installing the necessary dependencies:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create new directory and move into it&lt;/span&gt;
&lt;span class="nb"&gt;mkdir &lt;/span&gt;scrapy-playwright
&lt;span class="nb"&gt;cd &lt;/span&gt;scrapy-playwright

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Installation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Scrapy and scrapy-playwright&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy scrapy-playwright

&lt;span class="c"&gt;# Install the required browsers if you are running Playwright for the first time&lt;/span&gt;
playwright &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Next, we start the Scrapy project and generate a spider:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject scrapy_playwright_project
scrapy genspider mintmobile https://www.mintmobile.com/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now let's activate &lt;code&gt;scrapy-playwright&lt;/code&gt; by adding a few lines of configuration to our &lt;code&gt;DOWNLOAD_HANDLERS&lt;/code&gt; middleware.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scrapy-playwright configuration
&lt;/span&gt;
&lt;span class="n"&gt;DOWNLOAD_HANDLERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Great! Were now ready to write some code to scrape our target website.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Code&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy_playwright.page&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PageMethod&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MintmobileSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;mintmobile&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://www.mintmobile.com/product/google-pixel-7-pro-bundle/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="c1"&gt;# Use Playwright
&lt;/span&gt;            &lt;span class="n"&gt;playwright&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# Keep the page object so we can work with it later on
&lt;/span&gt;            &lt;span class="n"&gt;playwright_include_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# Use PageMethods to wait for the content we want to scrape to be properly loaded before extracting the data
&lt;/span&gt;            &lt;span class="n"&gt;playwright_page_methods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="nc"&gt;PageMethod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wait_for_selector&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;div.m-productCard--device&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.m-productCard__heading h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.composited_product_details_wrapper &amp;gt; div &amp;gt; div &amp;gt; div:nth-child(2) &amp;gt; div.label &amp;gt; span::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pay_monthly_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.composite_price_monthly &amp;gt; span::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pay_today_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.composite_price p.price span.amount::attr(aria-label)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In the &lt;code&gt;start_requests&lt;/code&gt; method, the spider makes a single HTTP request to the mobile phone product page on the Mint Mobile website. We initialize this request using the &lt;code&gt;scrapy.Request&lt;/code&gt; class while passing a &lt;code&gt;meta&lt;/code&gt; dictionary setting the options we want to use for Playwright to scrape the page. These options include &lt;code&gt;playwright&lt;/code&gt; set to &lt;code&gt;True&lt;/code&gt; to indicate that Playwright should be used, followed by &lt;code&gt;playwright_include_page&lt;/code&gt; also set to &lt;code&gt;True&lt;/code&gt; to enable us to save the page object so that it can be used later, and &lt;code&gt;playwright_page_methods&lt;/code&gt; set to a list of &lt;code&gt;PageMethod&lt;/code&gt; objects.&lt;/p&gt;

&lt;p&gt;In this case, theres only one &lt;code&gt;PageMethod&lt;/code&gt; object, which uses Playwright's &lt;code&gt;wait_for_selector&lt;/code&gt; method to wait for a specific CSS selector to appear on the page. This is done to ensure that the page has properly loaded before we start extracting its data.&lt;/p&gt;

&lt;p&gt;In the &lt;code&gt;parse&lt;/code&gt; method, the spider uses CSS selectors to extract data from the page. Four pieces of data are extracted: the &lt;code&gt;name&lt;/code&gt; of the product, its &lt;code&gt;memory&lt;/code&gt; capacity, the &lt;code&gt;pay_monthly_price&lt;/code&gt;, as well as the &lt;code&gt;pay_today_price&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Expected output:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, lets run our spider using the command &lt;code&gt;scrapy crawl mintmobile -o data.json&lt;/code&gt; to scrape the target data and store it in a &lt;code&gt;data.json&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[
    {
        "name": "Google Pixel 7 Pro",
        "memory": "128GB",
        "pay_monthly_price": "50",
        "pay_today_price": "589"
    }
]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Deploying Scrapy spiders to the cloud&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Next, well learn how to deploy Scrapy Spiders to the cloud using Apify. This allows us to configure them to run on a schedule and access many other features of the platform.&lt;/p&gt;

&lt;p&gt;To demonstrate this, well use the &lt;a href="https://docs.apify.com/sdk/python/?ref=blog.apify.com" rel="noopener noreferrer"&gt;Apify SDK for Python&lt;/a&gt; and select the Scrapy development template to help us kickstart the setup process. Well then modify the generated boilerplate code to run our CrawlSpider Hacker News scraper. Let's get started.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Installing the Apify CLI&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To start working with the Apify CLI, we need to install it first. There are two ways to do this: via the Homebrew package manager on macOS or Linux or via the Node.js package manager (NPM).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Via homebrew&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On macOS (or Linux), you can install the Apify CLI via the &lt;a href="https://brew.sh/?ref=blog.apify.com" rel="noopener noreferrer"&gt;Homebrew package manager&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;apify/tap/apify-cli

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Via NPM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Install or upgrade the Apify CLI by running:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nt"&gt;-g&lt;/span&gt; &lt;span class="nb"&gt;install &lt;/span&gt;apify-cli

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Creating a new Actor&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once you have the Apify CLI installed on your computer, simply run the following command in the terminal:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;apify create scrapy-actor

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then, go ahead and select &lt;strong&gt;Python Scrapy Install template&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdx5ucznn5iq0zf2ujhg5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdx5ucznn5iq0zf2ujhg5.png" width="800" height="276"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This command will create a new folder named &lt;code&gt;scrapy-actor&lt;/code&gt;, install all the necessary dependencies, and create a boilerplate code that we can use to kickstart our development using Scrapy and the &lt;a href="https://docs.apify.com/sdk/python/?ref=blog.apify.com" rel="noopener noreferrer"&gt;Apify SDK for Python&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally, move into the newly created folder and open it using your preferred code editor, in this example, Im using VS Code.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;scrapy-actor
code &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Configuring the Scrapy Actor template&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The template already creates a fully functional scraper. You can run it using the command &lt;code&gt;apify run&lt;/code&gt;. If youd like to try it before we modify the code, the scraped results will be stored under &lt;code&gt;storage/datasets&lt;/code&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that were familiar with the template, we can modify it to accommodate our HackerNews scraper.&lt;/p&gt;

&lt;p&gt;To make our first adjustment, we need to replace the template code in &lt;code&gt;src/spiders/title_spider.py&lt;/code&gt; with our own code. After the replacement, your code should look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnd82rdp8vqdqlghlonh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnd82rdp8vqdqlghlonh3.png" width="800" height="405"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add imports CrawlSpider, Rule and LinkExtractor 👇
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.spiders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rule&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.linkextractors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinkExtractor&lt;/span&gt;

&lt;span class="c1"&gt;# Change the spider from "scrapy.Spider" to "CrawlSpider"
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HackernewsSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hackernews&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news.ycombinator.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/news&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;custom_settings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DOWNLOAD_DELAY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="c1"&gt;# Add a 1-second delay between requests
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Define a rule that should be followed by the link extractor. 
&lt;/span&gt;    &lt;span class="c1"&gt;# In this case, Scrapy will follow all the links with the "morelink" class
&lt;/span&gt;    &lt;span class="c1"&gt;# And call the "parse_article" function on every crawled page
&lt;/span&gt;    &lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;allow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news\\.ycombinator\\.com/news$&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_article&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;restrict_css&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.morelink&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_article&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# When using the CrawlSpider we cannot use a parse function called "parse".
&lt;/span&gt;    &lt;span class="c1"&gt;# Otherwise, it will override the default function.
&lt;/span&gt;    &lt;span class="c1"&gt;# So, just rename it to something else, for example, "parse_article"
&lt;/span&gt;    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tr.athing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rank::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Finally, before running the Actor, we need to make some adjustments to the &lt;a href="http://main.py" rel="noopener noreferrer"&gt;&lt;code&gt;main.py&lt;/code&gt;&lt;/a&gt; file to align it with the modifications we made to the original spider template.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1k38m4yoizr5pfkaa0bi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1k38m4yoizr5pfkaa0bi.png" width="800" height="322"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.crawler&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrawlerProcess&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.utils.project&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_project_settings&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;apify&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Actor&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.pipelines&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ActorDatasetPushPipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.spiders.hackernews_spider&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HackernewsSpider&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;actor_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_input&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="n"&gt;max_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;actor_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;max_depth&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start_url&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;start_url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;actor_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_urls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/news&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}])]&lt;/span&gt;

        &lt;span class="n"&gt;settings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_project_settings&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ITEM_PIPELINES&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;ActorDatasetPushPipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DEPTH_LIMIT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt;

        &lt;span class="n"&gt;process&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CrawlerProcess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;install_root_handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# If you want to run multiple spiders, call `process.crawl` for each of them here
&lt;/span&gt;        &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HackernewsSpider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_urls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;start_urls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Running the Actor locally&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Great! Now we're ready to run our Scrapy actor. To do so, lets type the command &lt;code&gt;apify run&lt;/code&gt; in our terminal. After a few seconds, the &lt;code&gt;storage/datasets&lt;/code&gt; will be populated with the scraped data from Hacker News.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzttvdyjutkpxgnvaudaz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzttvdyjutkpxgnvaudaz.png" width="800" height="475"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Deploying the Actor to Apify&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before deploying the Actor to Apify, we need to make one final adjustment. Go to &lt;code&gt;.actor/input_schema.json&lt;/code&gt; and change the &lt;strong&gt;prefill&lt;/strong&gt; URL to &lt;a href="https://news.ycombinator.com/news" rel="noopener noreferrer"&gt;&lt;code&gt;https://news.ycombinator.com/news&lt;/code&gt;&lt;/a&gt;. This change is important when running the scraper on the Apify platform.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oje4vcoxrg8eu7i75qi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9oje4vcoxrg8eu7i75qi.png" width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that we know that our Actor is working as expected, it is time to deploy it to the Apify Platform. You will need to &lt;a href="https://console.apify.com/sign-up?ref=blog.apify.com" rel="noopener noreferrer"&gt;sign up for a free Apify account&lt;/a&gt; to follow along.&lt;/p&gt;

&lt;p&gt;Once you have an Apify account, run the command &lt;code&gt;apify login&lt;/code&gt; in the terminal. You will be prompted to provide your Apify API Token. Which you can find in Apify Console under &lt;a href="https://console.apify.com/account?ref=blog.apify.com#/integrations" rel="noopener noreferrer"&gt;Settings Integrations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The final step is to run the &lt;code&gt;apify push&lt;/code&gt; command. This will start an Actor build, and after a few seconds, you should be able to see your newly created Actor in Apify Console under &lt;a href="https://console.apify.com/actors?tab=my&amp;amp;ref=blog.apify.com" rel="noopener noreferrer"&gt;Actors My actors&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvpt24pdz8cfzzo3c15uy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvpt24pdz8cfzzo3c15uy.png" width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Perfect! Your scraper is ready to run on the Apify platform. To begin, click the &lt;strong&gt;Start&lt;/strong&gt; button. Once the run is finished, you can preview and download your data in multiple formats in the &lt;strong&gt;Storage&lt;/strong&gt; tab.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/alternatives-scrapy-web-scraping/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2F2024%2F08%2Falternatives_to_scrapy.png" height="449" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/alternatives-scrapy-web-scraping/" rel="noopener noreferrer" class="c-link"&gt;
          Scrapy alternatives in 2025
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          A curated list of libraries for web scraping in Python.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2Fsize%2Fw256h256%2F2025%2F07%2Ffavicon.png" width="48" height="48"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Next steps&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you want to take your web scraping projects to the next level with the Apify SDK for Python and the Apify platform, here are some useful resources that might help you:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;More Python Actor templates&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.apify.com/sdk/python/docs/guides/requests-and-httpx?ref=blog.apify.com" rel="noopener noreferrer"&gt;&lt;strong&gt;Scraping with Requests and HTTPX&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.apify.com/sdk/python/docs/guides/beautiful-soup?ref=blog.apify.com" rel="noopener noreferrer"&gt;&lt;strong&gt;Scraping with BeautifulSoup&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.apify.com/sdk/python/docs/guides/playwright?ref=blog.apify.com" rel="noopener noreferrer"&gt;&lt;strong&gt;Scraping with Playwright&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.apify.com/sdk/python/docs/guides/selenium?ref=blog.apify.com" rel="noopener noreferrer"&gt;&lt;strong&gt;Scraping with Selenium&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Web Scraping Python tutorials&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/web-scraping-python/" rel="noopener noreferrer"&gt;&lt;strong&gt;Web scraping with Python&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/web-scraping-with-beautiful-soup/" rel="noopener noreferrer"&gt;&lt;strong&gt;Web scraping with Beautiful Soup and Requests&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/web-scraping-with-selenium-and-python/" rel="noopener noreferrer"&gt;&lt;strong&gt;Web scraping with Selenium and Python&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Web Scraping community on Discord&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, don't forget to join the &lt;strong&gt;Apify &amp;amp; Crawlee&lt;/strong&gt; community on Discord to connect with other web scraping and automation enthusiasts. 🚀&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://discord.com/invite/jyEM2PRvMU?ref=blog.apify.com" rel="noopener noreferrer" class="c-link"&gt;
          Apify &amp;amp; Crawlee
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          This is the official developer community of Apify and Crawlee. | 11719 members
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdiscord.com%2Fassets%2Ffavicon.ico" width="256" height="256"&gt;
        discord.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
      <category>python</category>
      <category>webscraping</category>
      <category>scrapy</category>
    </item>
    <item>
      <title>Web scraping with Beautiful Soup and Requests</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Thu, 30 Mar 2023 13:51:03 +0000</pubDate>
      <link>https://dev.to/apify/web-scraping-with-beautiful-soup-and-requests-1j7o</link>
      <guid>https://dev.to/apify/web-scraping-with-beautiful-soup-and-requests-1j7o</guid>
      <description>&lt;h2&gt;
  
  
  Introduction and requirements
&lt;/h2&gt;

&lt;p&gt;The internet is an endless source of information, and for many data-driven tasks, accessing this information is critical. For this reason, &lt;a href="https://blog.apify.com/what-are-web-crawlers-and-how-do-they-work/#web-scraping" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt;, the practice of extracting data from websites, has become an increasingly important tool for machine learning developers, data analysts, researchers, and businesses alike.&lt;/p&gt;

&lt;p&gt;One of the most popular web scraping tools is &lt;a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/?ref=blog.apify.com" rel="noopener noreferrer"&gt;Beautiful Soup&lt;/a&gt;, a Python library that allows you to parse HTML and XML documents. Beautiful Soup makes it easy to extract specific pieces of information from web pages, and it can handle many of the quirks and inconsistencies that come with web scraping.&lt;/p&gt;

&lt;p&gt;Another crucial tool for web scraping is &lt;a href="https://github.com/psf/requests?ref=blog.apify.com" rel="noopener noreferrer"&gt;Requests&lt;/a&gt;, a Python library for making HTTP requests. Python Requests allow you to send HTTP requests extremely easily and comes with a range of helpful features, including handling cookies and authentication.&lt;/p&gt;

&lt;p&gt;In this article, we will explore the basics of web scraping with Beautiful Soup and Requests, covering everything from sending HTTP requests to parsing the resulting HTML and extracting useful data. We will also go over how to handle website pagination to extract data from multiple pages. Finally, we will explore a few tricks we can use to &lt;a href="https://blog.apify.com/what-is-ethical-web-scraping-and-how-do-you-do-it/" rel="noopener noreferrer"&gt;scrape the web ethically&lt;/a&gt; while avoiding getting our scrapers blocked by modern &lt;a href="https://blog.apify.com/bypass-antiscraping-protections/" rel="noopener noreferrer"&gt;anti-bot protections&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To demonstrate all of that, we will build a &lt;a href="https://news.ycombinator.com/?ref=blog.apify.com" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt; scraper using the Requests and Beautiful Soup Python libraries to extract the &lt;strong&gt;rank&lt;/strong&gt; , &lt;strong&gt;URL&lt;/strong&gt; , and &lt;strong&gt;title&lt;/strong&gt; from all articles posted on HN. So, without further ado, let's start coding!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flsn7h66y4gwhauazobxd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flsn7h66y4gwhauazobxd.png" alt="https://blog.apify.com/content/images/2023/01/Hacker_News.png" width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial setup
&lt;/h2&gt;

&lt;p&gt;First, let's create a new directory &lt;code&gt;hacker-news-scraper&lt;/code&gt; to house our scraper, then move into it and create a new file named &lt;a href="http://main.py" rel="noopener noreferrer"&gt;&lt;code&gt;main.py&lt;/code&gt;&lt;/a&gt;. We can either do it manually or straight from the terminal by using the following commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mkdir&lt;/span&gt; &lt;span class="n"&gt;hacker&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;news&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;scraper&lt;/span&gt;

&lt;span class="n"&gt;cd&lt;/span&gt; &lt;span class="n"&gt;hacker&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;news&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;scraper&lt;/span&gt;

&lt;span class="n"&gt;touch&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Still in the terminal, let's use pip to install Requests and Beautiful Soup. Finally, we can open our project in our code editor of choice. Since I'm using VS Code, I will use command &lt;code&gt;code .&lt;/code&gt; to open the current directory in VS Code.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt; &lt;span class="n"&gt;beautifulsoup4&lt;/span&gt;

&lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  How to make an HTTP GET request with Requests
&lt;/h2&gt;

&lt;p&gt;In the &lt;a href="http://main.py" rel="noopener noreferrer"&gt;&lt;code&gt;main.py&lt;/code&gt;&lt;/a&gt; file, we will use Requests to make a GET request to our target website and save the obtained HTML code of the page to a variable named &lt;code&gt;html&lt;/code&gt; and log it to the console.&lt;/p&gt;
&lt;h3&gt;
  
  
  Code
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Output
&lt;/h3&gt;

&lt;p&gt;And here is the result we expect to see after running our script:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw5gfyaanoduhzo338ts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faw5gfyaanoduhzo338ts.png" width="800" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Great! Now that we are properly targeting the page's HTML code, it's time to use Beautiful Soup to parse the code and extract the specific data we want.&lt;/p&gt;
&lt;h2&gt;
  
  
  Parsing the data with Beautiful Soup
&lt;/h2&gt;

&lt;p&gt;Next, let's use Beautiful Soup to parse the HTML data and scrape the contents from all the articles on the first page of &lt;a href="https://news.ycombinator.com/news?ref=blog.apify.com" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Before we select an element, let's use the &lt;a href="https://developers.apify.com/academy/web-scraping-for-beginners/data-collection/browser-devtools?ref=blog.apify.com" rel="noopener noreferrer"&gt;developer tools&lt;/a&gt; to inspect the page and find what selectors we need to use to target the data we want to extract.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4ck8s5vq7a7ei6ei29q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4ck8s5vq7a7ei6ei29q.png" alt="https://blog.apify.com/content/images/2023/01/Fullscreen_1_12_23__1_16_PM.png" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When analyzing the website's structure, we can find each article's &lt;strong&gt;rank&lt;/strong&gt; and &lt;strong&gt;title&lt;/strong&gt; by selecting the element containing the class &lt;code&gt;athing&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Traversing the DOM with the BeautifulSoup &lt;em&gt;find&lt;/em&gt; method
&lt;/h2&gt;

&lt;p&gt;Next, let's use Beautiful Soup &lt;code&gt;find_all&lt;/code&gt; method to select all elements containing the &lt;code&gt;athing&lt;/code&gt; class and save them to a variable named &lt;code&gt;articles&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;athing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Next, to verify we have successfully selected the correct elements, let's loop through each article and print its text contents to the console.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;athing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Loop through the selected elements
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Log each article's text content to the console
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Great! We've managed to access each element's &lt;strong&gt;rank&lt;/strong&gt; and &lt;strong&gt;title&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the next step, we will use BeautifulSoup's &lt;code&gt;find&lt;/code&gt; method to grab the specific values we want to extract and organize the obtained data in a Python dictionary.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;find&lt;/code&gt; method is used to get the descendants of an element in the current set of matched elements filtered by a selector.&lt;/p&gt;

&lt;p&gt;In the context of our scraper, we can use &lt;code&gt;find&lt;/code&gt; to select specific descendants of each &lt;code&gt;article&lt;/code&gt; element.&lt;/p&gt;

&lt;p&gt;Returning to the Hacker News website, we can find the selectors we need to extract our target data.&lt;/p&gt;

&lt;p&gt;Here's what our code looks like using the &lt;code&gt;find&lt;/code&gt; method to get each article's &lt;strong&gt;URL, title,&lt;/strong&gt; and &lt;strong&gt;rank&lt;/strong&gt; :&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;athing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Finally, to make the data more presentable, lets use the &lt;code&gt;json&lt;/code&gt; library to save our output to a JSON file. Here is what our code looks like:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;athing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Extract data from each article on the page
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save scraped data
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Saving output data to JSON file.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;save_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hn_data.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;save_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Great! We've just scraped information from all the articles displayed on the first page of Hacker News using Requests and Beautiful Soup. However, it would be even better if we could get the data from all articles on Hacker News, right?&lt;/p&gt;

&lt;p&gt;Now that we know how to get the data from one page, we just have to apply this same logic to all the remaining pages of the website. So, in the next section, we will handle the websites pagination.&lt;/p&gt;
&lt;h2&gt;
  
  
  Handling Pagination
&lt;/h2&gt;

&lt;p&gt;The concept of handling pagination in web scraping is quite straightforward. In short, we need to make our scraper repeat its scraping logic for each page visited until no more pages are left. To do that, we have to find a way to identify when the scraper reaches the last page and then stop scraping, and save our extracted data.&lt;/p&gt;

&lt;p&gt;So, lets start by initializing three variables: &lt;code&gt;scraping_hn&lt;/code&gt;, &lt;code&gt;page&lt;/code&gt;, and &lt;code&gt;output&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scraping_hn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;scraping_hn&lt;/code&gt; is a Boolean variable that keeps track of whether the script has reached the last page of the website.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;page&lt;/code&gt; is an integer variable that keeps track of the current page number being scraped.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;output&lt;/code&gt; is an empty list that will be populated with the scraped data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next, lets create a &lt;code&gt;while&lt;/code&gt; loop that continues scraping until the scraper reaches the last page. Within the loop, we will send a GET request to the current page of Hacker News, so we can execute the rest of our script to extract the &lt;strong&gt;URL, title,&lt;/strong&gt; and &lt;strong&gt;rank&lt;/strong&gt; of each article and store the data in a dictionary with keys &lt;strong&gt;"URL"&lt;/strong&gt; , &lt;strong&gt;"title"&lt;/strong&gt; , and &lt;strong&gt;"rank"&lt;/strong&gt;. We will then append the dictionary to the output list.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scraping_hn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Starting Hacker News Scraper...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Continue scraping until the scraper reaches the last page
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;scraping_hn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/?p=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scraping &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;athing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract data from each article on the page
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After extracting data from all articles on the page, we will write an &lt;em&gt;if statement&lt;/em&gt; to check whether there is a &lt;strong&gt;More&lt;/strong&gt; button with the class &lt;code&gt;morelink&lt;/code&gt; on the page. We will check for this particular element because the &lt;strong&gt;More&lt;/strong&gt; button is present on all pages, except the last one.&lt;/p&gt;

&lt;p&gt;So, if the &lt;code&gt;morelink&lt;/code&gt; class is present, the script increments the page variable and continues scraping the next page. If there is no &lt;code&gt;morelink&lt;/code&gt; class, the script sets &lt;code&gt;scraping_hn&lt;/code&gt; to &lt;code&gt;False&lt;/code&gt; and exits the loop.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check if the scraper reached the last page
&lt;/span&gt;    &lt;span class="n"&gt;next_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;morelink&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_page&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;scraping_hn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Finished scraping! Scraped a total of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; items.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Putting it all together, here is the code we have so far:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="n"&gt;scraping_hn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Starting Hacker News Scraper...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Continue scraping until the scraper reaches the last page
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;scraping_hn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/?p=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scraping &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;athing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract data from each article on the page
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if the scraper reached the last page
&lt;/span&gt;    &lt;span class="n"&gt;next_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;morelink&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_page&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;scraping_hn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Finished scraping! Scraped a total of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; items.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save scraped data
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Saving output data to JSON file.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;save_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hn_data.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;save_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;In conclusion, our script successfully accomplished its goal of extracting data from all articles on Hacker News by using &lt;code&gt;Requests&lt;/code&gt; and &lt;code&gt;BeautifulSoup&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;However, it is important to note that not all websites will be as simple to scrape as Hacker News. Most modern webpages have a variety of anti-bot protections in place to prevent malicious bots from overloading their servers with requests.&lt;/p&gt;

&lt;p&gt;In our situation, we are simply automating a data collection process without any malicious intent against the target website. So, in the next section, we will talk about what measures we can use to reduce the likelihood of our scrapers getting blocked.&lt;/p&gt;
&lt;h2&gt;
  
  
  Avoid being blocked with Requests
&lt;/h2&gt;

&lt;p&gt;Hacker News is a simple website without any aggressive anti-bot protections in place, so we were able to scrape it without running into any major blocking issues.&lt;/p&gt;

&lt;p&gt;Complex websites might employ different techniques to detect and block bots, such as analyzing the data encoded in HTTP requests received by the server, fingerprinting, CAPTCHAS, and more.&lt;/p&gt;

&lt;p&gt;Avoiding all types of blocking can be a very challenging task, and its difficulty varies according to your target website and the scale of your scraping activities.&lt;/p&gt;

&lt;p&gt;Nevertheless, there are some simple techniques, like passing the correct &lt;code&gt;User-Agent&lt;/code&gt; header that can already help our scrapers pass basic website verifications.&lt;/p&gt;
&lt;h3&gt;
  
  
  What is the User-Agent header?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;User-Agent&lt;/code&gt; header informs the server about the operating system, vendor, and version of the requesting client. This is relevant because any inconsistencies in the information the website receives may alert it about suspicious bot-like activity, leading to our scrapers getting blocked.&lt;/p&gt;

&lt;p&gt;One of the ways we can avoid this is by passing custom headers to the HTTP request we made earlier using Requests, thus ensuring that the &lt;code&gt;User-Agent&lt;/code&gt; used matches the one from the machine sending the request.&lt;/p&gt;

&lt;p&gt;You can check your own &lt;code&gt;User-Agent&lt;/code&gt; by accessing the &lt;a href="http://whatsmyuseragent.org/" rel="noopener noreferrer"&gt;http://whatsmyuseragent.org/&lt;/a&gt; website. For example, this is my computer's &lt;code&gt;User-Agent&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxweg37lo87nj1pi685tb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxweg37lo87nj1pi685tb.png" alt="https://blog.apify.com/content/images/2023/01/What_s_my_User_Agent_.png" width="800" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this information, we can now pass the &lt;code&gt;User-Agent&lt;/code&gt; header to our Requests HTTP request.&lt;/p&gt;
&lt;h3&gt;
  
  
  How to use the User-Agent header in Requests
&lt;/h3&gt;

&lt;p&gt;In order to verify that Requests is indeed sending the specified headers, let's create a new file named &lt;a href="http://headers-test.py" rel="noopener noreferrer"&gt;&lt;code&gt;headers-test.py&lt;/code&gt;&lt;/a&gt; and send a request to the website &lt;a href="https://httpbin.org/" rel="noopener noreferrer"&gt;https://httpbin.org/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To send custom headers using Requests, we will pass a &lt;code&gt;params&lt;/code&gt; parameter to the request method:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://httpbin.org/headers&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After running the &lt;code&gt;python3&lt;/code&gt; &lt;a href="http://headers-test.py" rel="noopener noreferrer"&gt;&lt;code&gt;headers-test.py&lt;/code&gt;&lt;/a&gt; command, we can expect to see our request headers printed to the console:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx1fusfpy25bxdqonm058.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx1fusfpy25bxdqonm058.png" width="800" height="131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can verify by checking the &lt;code&gt;User-Agent&lt;/code&gt;, Requests used the custom headers we passed as a parameter to the request.&lt;/p&gt;

&lt;p&gt;In contrast, that's how the &lt;code&gt;User-Agent&lt;/code&gt; for the same request would look like if we didn't pass any custom parameters:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuaph5l9soxpoj30j0fh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbuaph5l9soxpoj30j0fh.png" width="800" height="130"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cool, now that we know how to properly pass custom headers to a Requests HTTP request, we can implement the same logic in our Hacker News scraper.&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
      &lt;div class="c-embed__cover"&gt;
        &lt;a href="https://blog.apify.com/crawl-without-getting-blocked/" class="c-link s:max-w-50 align-middle" rel="noopener noreferrer"&gt;
          &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2Fsize%2Fw1200%2F2024%2F03%2FAvoid-getting-blocked.png" height="449" class="m-0" width="800"&gt;
        &lt;/a&gt;
      &lt;/div&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://blog.apify.com/crawl-without-getting-blocked/" rel="noopener noreferrer" class="c-link"&gt;
          21 tips on how to crawl a website without getting blocked
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          Guide on how to solve or avoid anti-scraping protections.
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.apify.com%2Fcontent%2Fimages%2Fsize%2Fw256h256%2F2025%2F07%2Ffavicon.png" width="48" height="48"&gt;
        blog.apify.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



&lt;h3&gt;
  
  
  Required headers, cookies, and tokens
&lt;/h3&gt;

&lt;p&gt;Setting the proper &lt;code&gt;User-Agent&lt;/code&gt; header will help you avoid blocking, but it is not enough to overcome more sophisticated anti-bot systems present in modern websites.&lt;/p&gt;

&lt;p&gt;There are many other types of information, such as additional headers, cookies, and access tokens, that we might be required to send with our request in order to get to the data we want. If you want to know more about the topic, check out the &lt;a href="https://developers.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens?ref=blog.apify.com" rel="noopener noreferrer"&gt;&lt;strong&gt;Dealing with headers, cookies, and tokens&lt;/strong&gt;&lt;/a&gt; section of the Apify Web Scraping Academy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Restricting the number of requests sent to the server
&lt;/h3&gt;

&lt;p&gt;Another common strategy employed by anti-scraping protections is to monitor the frequency of requests sent to the server. If too many requests are sent in a short period of time, the server may flag the IP address of the scraper and block further requests from that address.&lt;/p&gt;

&lt;p&gt;An easy way to work around this limitation is to introduce a time delay between requests, giving the server enough time to process the previous request and respond before the next request is sent.&lt;/p&gt;

&lt;p&gt;To do that, we can use the &lt;code&gt;time.sleep()&lt;/code&gt; method before each HTTP request to slow down the frequency of requests to the server. This approach can help to reduce the chances of being blocked by anti-scraping protections and allow our script to scrape the website's data more reliably and efficiently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Wait before each request to avoid overloading the server
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://news.ycombinator.com/?p=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Final &lt;strong&gt;code&lt;/strong&gt;
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="n"&gt;scraping_hn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Starting Hacker News Scraper...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Continue scraping until the scraper reaches the last page
&lt;/span&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;scraping_hn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Wait before each request to avoid overloading the server
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://news.ycombinator.com/?p=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scraping &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Use Beautiful Soup to parse the HTML
&lt;/span&gt;    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;athing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract data from each article on the page
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;titleline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getText&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if the scraper reached the last page
&lt;/span&gt;    &lt;span class="n"&gt;next_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;morelink&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next_page&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;scraping_hn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Finished scraping! Scraped a total of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; items.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save scraped data
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Saving output data to JSON file.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;save_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hn_data.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;save_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  GitHub repository
&lt;/h3&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fassets.dev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/PerVillalva" rel="noopener noreferrer"&gt;
        PerVillalva
      &lt;/a&gt; / &lt;a href="https://github.com/PerVillalva/bs4-hn-scraper" rel="noopener noreferrer"&gt;
        bs4-hn-scraper
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      BeautifulSoup + Requests scraper to extract data from Hacker News
    &lt;/h3&gt;
  &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>python</category>
      <category>beautifulsoup</category>
      <category>requests</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Web scraping with Python</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Tue, 14 Feb 2023 13:21:35 +0000</pubDate>
      <link>https://dev.to/apify/web-scraping-with-python-ojp</link>
      <guid>https://dev.to/apify/web-scraping-with-python-ojp</guid>
      <description>&lt;p&gt;Explore some of the best Python libraries and frameworks available for web scraping and learn how to use them in your projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting started with web scraping in Python
&lt;/h2&gt;

&lt;p&gt;Python is one of the most popular programming languages out there and is used across many different fields, such as AI, web development, automation, data science, and data extraction.&lt;/p&gt;

&lt;p&gt;For years, Python has been the go-to language for data extraction, boasting a large community of developers as well as a wide range of web scraping tools to help scrapers extract almost any data they wish from the web.&lt;/p&gt;

&lt;p&gt;This article will explore some of the best libraries and frameworks available for web scraping in Python and provide a quick sample of how to use them in different scraping scenarios.&lt;/p&gt;

&lt;h3&gt;
  
  
  Requirements
&lt;/h3&gt;

&lt;p&gt;To fully understand the content and code samples showcased in this post, you should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Have Python installed on your computer&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Have a basic understanding of CSS selectors&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Be comfortable navigating the browser DevTools to find and select page elements&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  HTTP Clients
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;In the context of web scraping, HTTP clients are used for sending requests to the target website and retrieving information such as the website's HTML code or JSON payload.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Requests
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51d1874c0hnthjz7b2ao.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F51d1874c0hnthjz7b2ao.png" alt="Requests logo" width="195" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://requests.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Requests&lt;/a&gt; is the most popular HTTP library for Python. It is supported by solid documentation and has been adopted by a huge community.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚒️  Main Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep-Alive &amp;amp; Connection Pooling&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Browser-style SSL Verification&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;HTTP(S) Proxy Support&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Connection Timeouts&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chunked Requests&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚙️  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  💡  Code Sample
&lt;/h3&gt;

&lt;p&gt;Send a request to the target website, retrieve its HTML code, and print the result to the console.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;requests&lt;/span&gt;

&lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  HTTPX
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fy048ozyykcvxxhitrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3fy048ozyykcvxxhitrg.png" alt="HTTPX" width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.python-httpx.org/" rel="noopener noreferrer"&gt;HTTPX&lt;/a&gt; is a fully featured HTTP client library for Python 3, including an integrated command-line client while providing both sync and async APIs.&lt;/p&gt;
&lt;h3&gt;
  
  
  ⚒️  Main Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A broadly requests-compatible API&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;An integrated command-line client&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Standard synchronous interface, but with async support if you need it&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fully type annotated&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  ⚙️  Installation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Using pip&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;httpx

&lt;span class="c"&gt;# For Python 3 macOS users&lt;/span&gt;
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;httpx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  💡  Code Sample
&lt;/h3&gt;

&lt;p&gt;Similar to the &lt;code&gt;Requests&lt;/code&gt; example, we will send a request to the target website, retrieve the HTML of the page and print it to the console along with the request status code.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;
&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;html&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  HTML and XML parser
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;In web scraping, HTML and XML parsers are used to interpret the response we get back from our target website, often in the form of HTML code&lt;/em&gt;.* A library such as Beautiful Soup will help us parse this response and extract data from websites.*&lt;/p&gt;
&lt;h3&gt;
  
  
  Beautiful Soup
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1cws5eyr648bguyq2iw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1cws5eyr648bguyq2iw.png" alt="Beautiful Soup logo" width="800" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="http://www.crummy.com/software/BeautifulSoup/" rel="noopener noreferrer"&gt;Beautiful Soup&lt;/a&gt; (also known as BS4) is a Python library for pulling data out of HTML and XML files with just a few lines of code. BS4 is relatively easy to use and presents itself as a lightweight option for tackling simple scraping tasks with speed.&lt;/p&gt;
&lt;h3&gt;
  
  
  ⚒️ Main features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Implements a subset of core jQuery, providing developers with a familiar and easy-to-use syntax.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Works with a simple and consistent DOM model, making parsing, manipulating, and rendering incredibly efficient.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Offers great flexibility, being able to parse nearly any HTML or XML document.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  ⚙️  Installation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;beautifulsoup4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  💡  Code Sample
&lt;/h3&gt;

&lt;p&gt;Let's now see how we can use &lt;strong&gt;Beautiful Soup + HTTPX&lt;/strong&gt; to extract the &lt;strong&gt;title content&lt;/strong&gt;, &lt;strong&gt;rank&lt;/strong&gt;, and &lt;strong&gt;URL&lt;/strong&gt; from all the articles on the first page of &lt;a href="https://news.ycombinator.com/news" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;from bs4 import BeautifulSoup
import httpx

response &lt;span class="o"&gt;=&lt;/span&gt; httpx.get&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"https://news.ycombinator.com/news"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
yc_web_page &lt;span class="o"&gt;=&lt;/span&gt; response.content

soup &lt;span class="o"&gt;=&lt;/span&gt; BeautifulSoup&lt;span class="o"&gt;(&lt;/span&gt;yc_web_page&lt;span class="o"&gt;)&lt;/span&gt;
articles &lt;span class="o"&gt;=&lt;/span&gt; soup.find_all&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"athing"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for &lt;/span&gt;article &lt;span class="k"&gt;in &lt;/span&gt;articles:
    data &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"URL"&lt;/span&gt;: article.find&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"titleline"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.find&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"a"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.get&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'href'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;,
        &lt;span class="s2"&gt;"title"&lt;/span&gt;: article.find&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"titleline"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.getText&lt;span class="o"&gt;()&lt;/span&gt;,
        &lt;span class="s2"&gt;"rank"&lt;/span&gt;: article.find&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;class_&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"rank"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;.getText&lt;span class="o"&gt;()&lt;/span&gt;.replace&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"."&lt;/span&gt;, &lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    print&lt;span class="o"&gt;(&lt;/span&gt;data&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A few seconds after running the script, we will see a dictionary containing each article's URL, ranking, and title printed on our console.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://vpnoverview.com/news/wifi-routers-used-to-produce-3d-images-of-humans/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'WiFi Routers Used to Produce 3D Images of Humans (vpnoverview.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'1'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://openjdk.org/jeps/8300786'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'JEP draft: No longer require super() and this() to appear first in a constructor (openjdk.org)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'2'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'item?id=34482433'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Ask HN: Those making $500+/month on side projects in 2023 -- Show and tell'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'3'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.solipsys.co.uk/new/ThePointOfTheBanachTarskiTheorem.html?wa22hn'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'The Point of the Banach-Tarski Theorem (solipsys.co.uk)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'4'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://initialcommit.com/blog/git-sim'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Git-sim: Visually simulate Git operations in your own repos (initialcommit.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'5'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.cell.com/cell-reports-medicine/fulltext/S2666-3791(22)00474-8'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Brief structured respiration enhances mood and reduces physiological arousal (cell.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'6'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://en.wikipedia.org/wiki/I,_Libertine'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'I, Libertine (wikipedia.org)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'7'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'item?id=34465956'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Ask HN: Why did BASIC use line numbers instead of a full screen editor?'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'8'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://arxiv.org/abs/2203.03456'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Negative-weight single-source shortest paths in near-linear time (arxiv.org)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'9'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://onesignal.com/careers'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'OneSignal (YC S11) Is Hiring Engineers (onesignal.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'10'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://neelc.org/posts/chatgpt-gmail-spam/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s2"&gt;"Bypassing Gmail's spam filters with ChatGPT (neelc.org)"&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'11'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://cyber.dabamos.de/88x31/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'The 88x31 GIF Collection (dabamos.de)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'12'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.middleeasteye.net/opinion/david-graeber-vs-yuval-harari-forgotten-cities-myths-how-civilisation-began'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'The Dawn of Everything challenges a mainstream telling of prehistory (middleeasteye.net)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'13'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://blog.thinkst.com/2023/01/swipe-right-on-our-new-credit-card-tokens.html'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Detect breaches with Canary credit cards (thinkst.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'14'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.atlasobscura.com/articles/heritage-appalachian-apples'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Appalachian Apple hunter who rescued 1k '&lt;/span&gt;lost&lt;span class="s1"&gt;' varieties (2021) (atlasobscura.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'15'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.workingsoftware.dev/software-architecture-documentation-the-ultimate-guide/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'The Guide to Software Architecture Documentation (workingsoftware.dev)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'16'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://arstechnica.com/tech-policy/2023/01/supreme-court-allows-reddit-mods-to-anonymously-defend-section-230/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Supreme Court allows Reddit mods to anonymously defend Section 230 (arstechnica.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'17'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://neurosciencenews.com/insula-empathy-pain-21818/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'How do we experience the pain of other people? (neurosciencenews.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'18'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://lwn.net/SubscriberLink/920158/313ec4305df220bb/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Nolibc: A minimal C-library replacement shipped with the kernel (lwn.net)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'19'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.economist.com/1843/2017/05/04/the-body-in-the-buddha'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'The Body in the Buddha (2017) (economist.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'20'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://simonwillison.net/2023/Jan/13/semantic-search-answers/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'How to implement Q&amp;amp;A against your docs with GPT3 embeddings and Datasette (simonwillison.net)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'21'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://destevez.net/2023/01/decoding-lunar-flashlight/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Decoding Lunar Flashlight (destevez.net)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'22'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.hampsteadheath.net/about'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Hampstead Heath (hampsteadheath.net)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'23'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.otherlife.co/francisbacon/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'The violent focus of Francis Bacon (otherlife.co)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'24'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://arstechnica.com/gaming/2019/10/explaining-how-fighting-games-use-delay-based-and-rollback-netcode/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'How fighting games use delay-based and rollback netcode (2019) (arstechnica.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'25'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://essays.georgestrakhov.com/ai-is-not-a-horse/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'AI Is Not a Horse (georgestrakhov.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'26'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://lawliberty.org/features/the-mystery-of-richard-posner/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'The Mystery of Richard Posner (lawliberty.org)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'27'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://rodneybrooks.com/predictions-scorecard-2023-january-01/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Rodney Brooks Predictions Scorecard (rodneybrooks.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'28'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://www.notamonadtutorial.com/how-to-transform-code-into-arithmetic-circuits/'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'How to transform code into arithmetic circuits (notamonadtutorial.com)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'29'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'URL'&lt;/span&gt;: &lt;span class="s1"&gt;'https://github.com/jhhoward/WolfensteinCGA'&lt;/span&gt;, &lt;span class="s1"&gt;'title'&lt;/span&gt;: &lt;span class="s1"&gt;'Wolfenstein 3D with a CGA Renderer (github.com/jhhoward)'&lt;/span&gt;, &lt;span class="s1"&gt;'rank'&lt;/span&gt;: &lt;span class="s1"&gt;'30'&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Browser automation tools
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Browser automation libraries and frameworks&lt;/em&gt; have an off-label use for web scraping. Their ability to emulate a real browser &lt;em&gt;is essentialfor&lt;/em&gt; access*ing* data on websites that require JavaScript to load their content.**&lt;/p&gt;
&lt;h3&gt;
  
  
  Selenium
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqnkxjnfbw7yl9sss3st.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqnkxjnfbw7yl9sss3st.png" alt="Selenium logo" width="300" height="270"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Selenium is primarily a browser automation framework and ecosystem with an off-label use for web scraping. It uses the WebDriver protocol to control a headless browser and perform actions like clicking buttons, filling out forms, and scrolling.&lt;/p&gt;

&lt;p&gt;Because of its ability to render JavaScript, Selenium can be used to scrape dynamically loaded content.&lt;/p&gt;
&lt;h3&gt;
  
  
  ⚒️ Main features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Browser Support (Firefox, Chrome, Safari, Opera...)&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-Language Compatibility&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dynamic web elements handling&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  ⚙️  Installation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Selenium&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;selenium

&lt;span class="c"&gt;# We will also need to install webdriver-manager to run the code sample below&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;webdriver-manager
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  💡  Code Sample
&lt;/h3&gt;

&lt;p&gt;To demonstrate some of Selenium's capabilities, let's go to Amazon, scrape &lt;a href="https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&amp;amp;qid=1642536225&amp;amp;sr=8-1" rel="noopener noreferrer"&gt;The Hitchhiker's Guide to the Galaxy&lt;/a&gt; product page, and save a screenshot of the accessed page.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;selenium.webdriver.common.by&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;By&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;webdriver_manager.chrome&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChromeDriverManager&lt;/span&gt;

&lt;span class="c1"&gt;# Insert the website URL that we want to scrape
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;driver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;webdriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Chrome&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ChromeDriverManager&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;install&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create a dictionary with the scraped data
&lt;/span&gt;&lt;span class="n"&gt;book&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;book_title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;productTitle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.a-link-normal.contributorNameID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;edition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;productSubtitle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CSS_SELECTOR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.a-size-base.a-color-price.a-color-price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Save a screenshot from the accessed page and print the dictionary contents to the console
&lt;/span&gt;&lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;book.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After the script finishes its run, we will see an object containing the book's &lt;strong&gt;title, author, edition,&lt;/strong&gt; and &lt;strong&gt;prices&lt;/strong&gt; logged to the console, and a screenshot of the page saved as &lt;code&gt;book.png&lt;/code&gt; .&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"book_title"&lt;/span&gt;: &lt;span class="s2"&gt;"The Hitchhiker's Guide to the Galaxy: The Illustrated Edition"&lt;/span&gt;,
    &lt;span class="s2"&gt;"author"&lt;/span&gt;: &lt;span class="s2"&gt;"Douglas Adams"&lt;/span&gt;,
    &lt;span class="s2"&gt;"edition"&lt;/span&gt;: &lt;span class="s2"&gt;"Kindle Edition"&lt;/span&gt;,
    &lt;span class="s2"&gt;"price"&lt;/span&gt;: &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$7&lt;/span&gt;&lt;span class="s2"&gt;.99"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Saved image:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/content/images/2023/01/book_png_-_python-post-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkprheysrihsd6zffkx65.png" width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Playwright
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsbdavarevisuvghq81j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsbdavarevisuvghq81j.png" alt="Playwright logo" width="646" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By definition, &lt;a href="https://github.com/microsoft/playwright" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt; is an open-source framework for web testing and automation developed and maintained by Microsoft.&lt;/p&gt;

&lt;p&gt;Despite having many features in common with Selenium, Playwright is considered a more modern and capable choice for automation, testing, and web scraping in Python.&lt;/p&gt;
&lt;h3&gt;
  
  
  ⚒️ Main features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auto-wait. Playwright, by default, waits for elements to be actionable before performing actions, eliminating the need for artificial timeouts.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-browser support, being able to drive Chromium, WebKit, Firefox, and Microsoft Edge.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cross-platform support. Available on Windows, Linux, and macOS, locally or on CI, headless, or headed.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  ⚙️  Installation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Using pip&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;pytest-playwright

&lt;span class="c"&gt;# For Python 3 macOS users&lt;/span&gt;
pip3 &lt;span class="nb"&gt;install &lt;/span&gt;pytest-playwright

&lt;span class="c"&gt;# Install the required browsers&lt;/span&gt;
playwright &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  💡  Code Sample
&lt;/h3&gt;

&lt;p&gt;To highlight Playwright's features as well as its similarities with Selenium, let's go back to Amazon's website and extract some data from &lt;a href="https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/ref=tmm_kin_swatch_0?_encoding=UTF8&amp;amp;qid=1642536225&amp;amp;sr=8-1" rel="noopener noreferrer"&gt;The Hitchhiker's Guide to the Galaxy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Playwright version:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.sync_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sync_playwright&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;sync_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Create a dictionary with the scraped data
&lt;/span&gt;    &lt;span class="n"&gt;book&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;book_title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;#productTitle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;author&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.author .a-link-normal.contributorNameID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;edition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;#productSubtitle&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.a-size-base.a-color-price.a-color-price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;book&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;book.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After the scraper finishes its run, the Firefox browser controlled by Playwright will close, and the extracted data will be logged into the console.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scrapy: a full-fledged Python web crawling framework
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Scrapy
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4ret06q96d9z7bfjmx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk4ret06q96d9z7bfjmx4.png" alt="HTTPX" width="800" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scrapy is a fast high-level web crawling and web scraping framework written with &lt;a href="https://twistedmatrix.com/trac/" rel="noopener noreferrer"&gt;Twisted&lt;/a&gt;, a popular event-driven networking framework, which gives it asynchronous capabilities.&lt;/p&gt;

&lt;p&gt;Unlike the tools mentioned earlier, Scrapy is a full-fledged web crawling framework designed specifically for data extraction, with built-in support for handling requests, processing responses, and exporting data.&lt;/p&gt;

&lt;p&gt;Additionally, Scrapy provides handy out-of-the-box features, such as support for following links, handling multiple request types, and error handling, making it a  powerful tool for web scraping projects of any size and complexity.&lt;/p&gt;
&lt;h3&gt;
  
  
  ⚒️ Main features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Feed exports in multiple formats, such as JSON, CSV, and XML.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;An interactive shell console for trying out the CSS and XPath expressions to scrape data and debug your spiders.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Built-in extensions and middlewares for handling, cookies, HTTP authentication and caching user-agent spoofing, and more&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  ⚙️  Installation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  📁 Project setup
&lt;/h3&gt;

&lt;p&gt;To demonstrate some Scrapy's features, we will once again extract data from articles displayed on &lt;a href="https://news.ycombinator.com/" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We will start by scraping the top 30 articles and then use Scrapy's &lt;code&gt;CrawlSpider&lt;/code&gt; to follow the available page links and extract data from all the articles on the website.&lt;/p&gt;

&lt;p&gt;To begin, let's create a new directory and install Scrapy to initialize the project and create a new spider:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create new directory and move into it&lt;/span&gt;
&lt;span class="nb"&gt;mkdir &lt;/span&gt;scrapy-project
&lt;span class="nb"&gt;cd &lt;/span&gt;scrapy-project

&lt;span class="c"&gt;# Install Scrapy&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy

&lt;span class="c"&gt;# Initialize project&lt;/span&gt;
scrapy startproject scrapydemo

&lt;span class="c"&gt;# Generate spider&lt;/span&gt;
scrapy genspider demospider https://news.ycombinator.com/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After our spider is generated, let's specify the encoding for the output file, which will contain the data scraped from the target website by adding &lt;code&gt;FEED_EXPORT_ENCODING = "utf-8"&lt;/code&gt; to our &lt;a href="http://settings.py" rel="noopener noreferrer"&gt;&lt;code&gt;settings.py&lt;/code&gt;&lt;/a&gt; file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/content/images/2023/01/settings_py_-_python-scrapy.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp8bam2lr3s7hraiaei97.png" width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  💡  Code Sample
&lt;/h3&gt;

&lt;p&gt;Finally, go to the &lt;a href="http://demospider.py" rel="noopener noreferrer"&gt;&lt;code&gt;demospider.py&lt;/code&gt;&lt;/a&gt; file and write some code:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DemospiderSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;demospider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tr.athing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rank::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then, let's use the following command to run the spider and store the scraped data in a &lt;code&gt;results.json&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy crawl demospider &lt;span class="nt"&gt;-o&lt;/span&gt; results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  🕷️ Using Scrapy's CrawlSpider
&lt;/h3&gt;

&lt;p&gt;Now that we know how to extract data from the articles on the first page of Hacker News let's use Scrapy's &lt;code&gt;CrawlSpider&lt;/code&gt; to follow the next page links and collect the data from all the articles on the website.&lt;/p&gt;

&lt;p&gt;To do that, we will make some adjustments to our &lt;a href="http://demospider.py" rel="noopener noreferrer"&gt;&lt;code&gt;demospider.py&lt;/code&gt;&lt;/a&gt; file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add imports CrawlSpider, Rule and LinkExtractor 👇
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.spiders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Rule&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy.linkextractors&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinkExtractor&lt;/span&gt;

&lt;span class="c1"&gt;# Change the spider from "scrapy.Spider" to "CrawlSpider"
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DemospiderSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CrawlSpider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;demospider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;allowed_domains&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;news.ycombinator.com&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;start_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://news.ycombinator.com/news?p=1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Define a rule that should be followed by the link extractor.
&lt;/span&gt;    &lt;span class="c1"&gt;# In this case, Scrapy will follow all the links with the "morelink" class
&lt;/span&gt;    &lt;span class="c1"&gt;# And call the "parse_article" function on every crawled page
&lt;/span&gt;    &lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Rule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LinkExtractor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;restrict_css&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.morelink&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;callback&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;parse_article&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;follow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# When using the CrawlSpider we cannot use a parse function called "parse".
&lt;/span&gt;    &lt;span class="c1"&gt;# Otherwise, it will override the default function.
&lt;/span&gt;    &lt;span class="c1"&gt;# So, just rename it to something else, for example, "parse_article"
&lt;/span&gt;    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_article&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tr.athing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::attr(href)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.titleline a::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.rank::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Finally, let's add a small delay between each of Scrapy's requests to avoid overloading the server. We can do that by adding &lt;code&gt;DOWNLOAD_DELAY = 0.5&lt;/code&gt; to our &lt;a href="http://settings.py" rel="noopener noreferrer"&gt;&lt;code&gt;settings.py&lt;/code&gt;&lt;/a&gt; file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/content/images/2023/01/settings_py_-_python-scrapy-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv0mj4uksjnlcvy7peq3y.png" width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Great! Now we are ready to run our scraper and get the data from all the articles displayed on Hacker News. Just run the command &lt;code&gt;scrapy crawl demospider -o results.json&lt;/code&gt; and wait for the run to finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/content/images/2023/01/results_json_-_python-scrapy-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvzt5pmakkqfiu7wh2rix.png" width="800" height="665"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  🎭 Using Playwright with Scrapy
&lt;/h2&gt;

&lt;p&gt;Scrapy and Playwright are one of the most efficient combos for modern web scraping in Python.&lt;/p&gt;

&lt;p&gt;This combo allows us to benefit from Playwright's ability to access dynamically loaded content on websites, and retrieve code from the page, so we can use Scrapy to extract data from it.&lt;/p&gt;

&lt;p&gt;To integrate Playwright with Scrapy, we will use the &lt;a href="https://github.com/scrapy-plugins/scrapy-playwright" rel="noopener noreferrer"&gt;scrapy-playwright&lt;/a&gt; library. Then, we will scrape &lt;a href="https://www.mintmobile.com/product/google-pixel-7-pro-bundle/" rel="noopener noreferrer"&gt;&lt;code&gt;https://www.mintmobile.com/product/google-pixel-7-pro-bundle/&lt;/code&gt;&lt;/a&gt; to demonstrate how to extract data from a website using Playwright and Scrapy.&lt;/p&gt;

&lt;p&gt;Mint Mobile requires JavaScript to load most of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mint Mobile product page with JavaScript &lt;em&gt;disabled&lt;/em&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/content/images/2022/12/Google_Pixel_7_Pro_Bundle___Mint_Mobile.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxmfjzybwhx7m1stiuzo9.png" alt="Mint Mobile JavaScript disabled" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mint Mobile product page with JavaScript &lt;em&gt;enabled&lt;/em&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/content/images/2022/12/Google_Pixel_7_Pro_Bundle___Mint_Mobile-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5ksc64u8se30m9leetw.png" alt="Mint Mobile JavaScript enabled" width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  ⚙️  Project setup
&lt;/h3&gt;

&lt;p&gt;Start by creating a directory to house our project and installing the necessary dependencies:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create new directory and move into it&lt;/span&gt;
&lt;span class="nb"&gt;mkdir &lt;/span&gt;scrapy-playwright
&lt;span class="nb"&gt;cd &lt;/span&gt;scrapy-playwright
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Installation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Scrapy and scrapy-playwright&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;scrapy scrapy-playwright

&lt;span class="c"&gt;# Install the required browsers if you are running Playwright for the first time&lt;/span&gt;
playwright &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Or install a subset of the available browsers you plan on using&lt;/span&gt;
playwright &lt;span class="nb"&gt;install &lt;/span&gt;firefox chromium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Next, start the Scrapy project and generate a spider:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scrapy startproject pwsdemo
scrapy genspider demospider https://www.mintmobile.com/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now, let's activate &lt;code&gt;scrapy-playwright&lt;/code&gt; by adding &lt;code&gt;DOWNLOAD_HANDLERS&lt;/code&gt; and &lt;code&gt;TWISTED_REACTOR&lt;/code&gt; to the scraper configuration in &lt;a href="http://settings.py" rel="noopener noreferrer"&gt;&lt;code&gt;settings.py&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# scrapy-playwright configuration
&lt;/span&gt;
&lt;span class="n"&gt;DOWNLOAD_HANDLERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;TWISTED_REACTOR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;twisted.internet.asyncioreactor.AsyncioSelectorReactor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Great! We are now ready to write some code to scrape our target website.&lt;/p&gt;
&lt;h3&gt;
  
  
  💡  Code Sample
&lt;/h3&gt;

&lt;p&gt;So, without further ado, let's use Playwright + Scrapy to extract data from Mint Mobile.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scrapy_playwright.page&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PageMethod&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DemospiderSpider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;demospider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;start_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;scrapy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://www.mintmobile.com/product/google-pixel-7-pro-bundle/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;meta&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="c1"&gt;# Use Playwright
&lt;/span&gt;            &lt;span class="n"&gt;playwright&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# Keep the page object so we can work with it later on
&lt;/span&gt;            &lt;span class="n"&gt;playwright_include_page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# Use PageMethods to wait for the content we want to scrape to be properly loaded before extracting the data
&lt;/span&gt;            &lt;span class="n"&gt;playwright_page_methods&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="nc"&gt;PageMethod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wait_for_selector&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;div.m-productCard--device&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.m-productCard__heading h1::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.composited_product_details_wrapper &amp;gt; div &amp;gt; div &amp;gt; div:nth-child(2) &amp;gt; div.label &amp;gt; span::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pay_monthly_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.composite_price_monthly &amp;gt; span::text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pay_today_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.composite_price p.price span.amount::attr(aria-label)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
Finally, run the spider using the command &lt;code&gt;scrapy crawl demospider -o results.json&lt;/code&gt; to scrape the target data and store it in a &lt;code&gt;results.json&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"Google Pixel 7 Pro"&lt;/span&gt;,
        &lt;span class="s2"&gt;"memory"&lt;/span&gt;: &lt;span class="s2"&gt;"128GB"&lt;/span&gt;,
        &lt;span class="s2"&gt;"pay_monthly_price"&lt;/span&gt;: &lt;span class="s2"&gt;"50"&lt;/span&gt;,
        &lt;span class="s2"&gt;"pay_today_price"&lt;/span&gt;: &lt;span class="s2"&gt;"589"&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Learning resources 📚
&lt;/h2&gt;

&lt;p&gt;If you want to dive deeper into some of the libraries and frameworks we presented during this post, here is a curated list of great videos and articles about the topic:&lt;/p&gt;
&lt;h3&gt;
  
  
  General web scraping
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://developers.apify.com/academy/web-scraping-for-beginners" rel="noopener noreferrer"&gt;Web scraping for beginners&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://developers.apify.com/academy/node-js/choosing-the-right-scraper" rel="noopener noreferrer"&gt;How to choose the right scraper for the job&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://developers.apify.com/academy/api-scraping" rel="noopener noreferrer"&gt;API scraping&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://developers.apify.com/academy/advanced-web-scraping/scraping-paginated-sites#how-to-overcome-the-limit" rel="noopener noreferrer"&gt;Scraping websites with limited pagination&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://developers.apify.com/academy/anti-scraping" rel="noopener noreferrer"&gt;Anti-scraping protections&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Beautiful Soup Tutorials
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developers.apify.com/academy/python/scrape-data-python" rel="noopener noreferrer"&gt;How to scrape data in Python using Beautiful Soup&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/gRLHr664tXA"&gt;
&lt;/iframe&gt;
&lt;/p&gt;
&lt;h3&gt;
  
  
  Browser automation tools
&lt;/h3&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/Xjv1sY630Uc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/H2-5ecFwHHQ"&gt;
&lt;/iframe&gt;
&lt;/p&gt;
&lt;h3&gt;
  
  
  Scrapy
&lt;/h3&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/s4jtkzHhLzY"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/0wO7K-SoUHM"&gt;
&lt;/iframe&gt;
&lt;/p&gt;
&lt;h3&gt;
  
  
  Discord
&lt;/h3&gt;

&lt;p&gt;Finally, don't forget to join the &lt;strong&gt;Apify &amp;amp; Crawlee&lt;/strong&gt; community on Discord to connect with other web scraping and automation enthusiasts. 🚀&lt;/p&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__body"&gt;
      &lt;h2 class="fs-xl lh-tight"&gt;
        &lt;a href="https://discord.com/invite/jyEM2PRvMU" rel="noopener noreferrer" class="c-link"&gt;
          Apify &amp;amp; Crawlee
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;p class="truncate-at-3"&gt;
          This is the official developer community of Apify and Crawlee. | 11719 members
        &lt;/p&gt;
      &lt;div class="color-secondary fs-s flex items-center"&gt;
          &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdiscord.com%2Fassets%2Ffavicon.ico" width="256" height="256"&gt;
        discord.com
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;



</description>
      <category>python</category>
      <category>tutorial</category>
      <category>webscraping</category>
    </item>
  </channel>
</rss>
