DEV Community: Percival Villalva

Crawlee data storage types: saving files, screenshots, and JSON results

Percival Villalva — Mon, 27 Nov 2023 23:00:00 +0000

We're Apify , a full-stack web scraping and browser automation platform. We are the maintainers of the open-source library Crawlee .

Managing and storing the data you collect is a crucial part of any web scraping and data extraction project. It's often a complex task, especially when handling large datasets and ensuring output accuracy. Fortunately, Crawlee simplifies this process with its versatile storage types.

In this article, we will look at Crawlee's storage types and demonstrate how they can make our lives easier when extracting data from the web.

Setting up Crawlee

Setting up a Crawlee project is straightforward, provided you have Node and npm installed. To begin, create a new Crawlee project using the following command:

npx crawlee create crawlee-data

After running the command, you will be given a few template options to choose from. We will go with the CheerioCrawler JavaScript template. Remember, Crawlee's storage types are consistent across all crawlers, so the concepts we discuss here apply to any Crawlee crawler.

Crawlee template options

Once installed, you'll find your new project in the crawlee-data directory, ready with a template code that scrapes the crawlee.dev website:

To test it, simply run npm start in your terminal. You'll notice a storage folder appear with subfolders like datasets, key_value_stores, and request_queues.

Crawlee's storage can be divided into two categories: Request Storage (Request Queue and Request List) and Results Storage (Datasets and Key Value Stores). Both are stored locally by default in the ./storage directory.

Also, remember that Crawlee, by default, clears its storages before starting a crawler run. This action is taken to prevent old data from interfering with new crawling sessions. In case you need to clear the storages earlier than this, Crawlee provides a handy purgeDefaultStorages() helper function for this purpose.

Crawlee request queue

The request queue is a storage of URLs to be crawled. It's particularly useful for deep crawling, where you start with a few URLs and then recursively follow links to other pages.

Each Crawlee project run is associated with a default request queue, which is typically used to store URLs for that specific crawler run.

To illustrate that, lets go to the routes.js file in the template we just generated. There you will find the code below:

import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ enqueueLinks, log }) => { log.info(`enqueueing new URLs`); // Add links found on page to the queue await enqueueLinks({ globs: ['https://crawlee.dev/**'], label: 'detail', });});router.addHandler('detail', async ({ request, $, log, pushData }) => { const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});

Let's take a closer look at the addDefaultHandler function, particularly focusing on the enqueueLinks function it contains. The enqueueLinks function in Crawlee is designed to automatically detect all links on a page and add them to the request queue. However, its utility extends further as it allows us to specify certain options for more precise control over which links are added.

For instance, in our example, we use the globs option to ensure that only links starting with https://crawlee.dev/ are queued. Furthermore, we assign a detail label to these links. This labeling is particularly useful as it lets us refer to these links in subsequent handler functions, where we can define specific data extraction operations for pages associated with this label.

💡 See all the available options for enqueueLinks in the Crawlee documentation.

In line with our discussion on data storage types, we can now find all the links that our crawler has navigated through in the request_queues storage, located within the crawlers ./storage/request_queues directory. Here, we can access detailed information about each request that has been processed in the request queue.

Crawlee request list

The request list differs from the request queue as it's not a form of storage in the conventional sense. Instead, it's a predefined collection of URLs for the crawler to visit.

This approach is particularly suited for situations where you have a set of known URLs to crawl and don't plan to add new ones as the crawl progresses. Essentially, the request list is set in stone once created, with no option to modify it by adding or removing URLs.

To demonstrate this concept, we'll modify our template to utilize a predefined set of URLs in the request list rather than the request queue. We'll begin with adjustments to the main.js file.

main.js

import { CheerioCrawler, RequestList } from 'crawlee';import { router } from './routes.js';const sources = [{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },];const requestList = await RequestList.open('my-list', sources);const crawler = new CheerioCrawler({ requestList, requestHandler: router,});await crawler.run();

With this new approach, we created a predefined list of URLs, named sources, and passed this list into a newly established requestList. This requestList was then passed into our crawler object.

As for the routes.js file, we simplified it to include just a single request handler. This handler is now responsible for executing the data extraction logic on the URLs specified in the request list.

routes.js

import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ request, $, log, pushData }) => { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});

Following these modifications, when you run your code, you'll observe that only the URLs explicitly defined in our request list are being crawled.

This brings us to an important distinction between the two types of request storages. The request queue is dynamic, allowing for the addition and removal of URLs as needed. On the other hand, the request list is static once initialized and is not meant for dynamic changes.

With the request storage out of the way, lets now explore the result storage in Crawlee, starting with datasets.

Crawlee datasets

Datasets in Crawlee serve as repositories for structured data, where every entry possesses consistent attributes.

Datasets are designed for append-only operations. This means we can only add new records to a dataset, and altering or deleting existing ones is not an option. Each project run in Crawlee is linked to a default dataset, which is commonly utilized for storing precise results from our web crawling activities.

You might have noticed that each time we ran the crawler, the folder ./storage/datasets was populated with a series of JSON files containing extracted data.

Storing scraped data into a dataset is remarkably simple using Crawlee's Dataset.pushData() function. Each invocation of Dataset.pushData() generates a new table row, with the property names of your data serving as the column headings. By default, these rows are stored as JSON files on your disk. However, Crawlee allows you to integrate other storage systems as well.

And if you take a closer look at our addDefaultHandler function in routes.js you will see just how the pushData() function was used to append the scraped results to the Dataset.

For a practical example, lets take another look at the addDefaultHandler function within routes.js. Here, you can see how we used pushData() function to append the scraped results to the Dataset.

routes.js

router.addDefaultHandler(async ({ request, $, log, pushData }) => { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});

Key-value store

The key-value sto re in Crawlee is designed for storing and retrieving data records or files. Each record is tagged with a unique key and linked to a specific MIME content type. This feature makes it perfect for storing various types of data, such as screenshots, PDFs, or even for maintaining the state of crawlers.

Saving screenshots

To showcase the flexibility of the key-value store in Crawlee, let's take a screenshot of each page we crawl and save it using Crawlee's key-value store.

However, to do that, we need to switch our crawler from CheerioCrawler to PuppeteerCrawler. The good news is that adapting our code to different crawlers is quite straightforward. For this demonstration, we'll temporarily set aside the routes.js file and concentrate our crawler logic in the main.js file.

To get started with PuppeteerCrawler, the first step is to install the Puppeteer library:

npm install puppeteer

Next, adapt the code in your main.js file as shown below:

main.js

import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Capture the screenshot await saveSnapshot({ key, saveHtml: false }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();

After running the code above, we should see three screenshots, one for each website crawled, pop up on our crawlers key_value_store.

Saving pages as PDF files

Suppose we want to convert the page content into a PDF file and save it in the key-value store. This is entirely feasible with Crawlee. Thanks to Crawlee's PuppeteerCrawler being built upon Puppeteer, we can fully utilize all the native features of Puppeteer. To achieve this, we simply need to tweak our code a bit. Here's how to do it:

import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ page, request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Save as PDF await page.pdf({ path: `./storage/key_value_stores/default/${key}.pdf`, format: 'A4', }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();

Similar to the earlier example involving screenshots, executing this code will create three PDF files, each capturing the content of the accessed websites. These files will then be saved into Crawlees key-value store.

Doing more with your Crawlee scraper

Thats it for an introduction to Crawlees data storage types. As a next step, I encourage you to take your scraper to the next level by deploying it on the Apify platform as an Actor.

With your scraper running on the Apify platform, you gain access to all of Apify's extensive list of features tailored for web scraping jobs, like cloud storage and various data export options. Not sure what it means or how to do it? Dont worry, all the information you need is in this link to the Crawlee documentation.

Deploy your Crawlee scrapers on the Apify platform

Selenium page object model: what is POM and how can you use it?

Percival Villalva — Tue, 26 Sep 2023 22:00:00 +0000

As Selenium projects grow in complexity, maintaining and scaling test scripts can become challenging. This is where the Page Object Model (POM) steps in as a way for Selenium users to write more scalable and readable code.

Hey, we're Apify . The Apify platform gives you access to 1,500+ web scraping and automation tools. Or you can build your own. Check us out .

What is the Page Object Model (POM)?

At its core, the Page Object Model (POM) is a design pattern used in Selenium automation to represent a web application's web pages or components as objects in code. Each web page is associated with a Page Object, and this object encapsulates the page's structure, elements, interactions, and intricacies.

The Selenium POM is as finely balanced and intricate as clockwork

Why is POM essential for Selenium automation?

Imagine a scenario where you have a sizable Selenium test suite. Web pages change, elements get updated, and your tests require frequent adjustments. Without POM, managing this can become a nightmare. Test scripts often get cluttered with web element locators and actions, making them difficult to read and maintain. POM addresses these challenges by introducing the concept of Page Objects.

Think of a Page Object as a blueprint for a web page. It contains methods and properties that allow you to interact with the page's elements (e.g., buttons, text fields, links) and perform actions (e.g., clicking, typing) on them. By creating Page Objects, you achieve a clear separation of concerns: your test scripts focus on test logic, while the Page Objects handle the web page's details.

Advantages of using POM

Maintainability: In large-scale automation projects, web pages often change. Elements get updated, added, or removed. Without a structured approach like POM, maintaining your test scripts becomes a nightmare. POM allows you to isolate changes to Page Objects, making updates more manageable.
Readability: POM promotes readable and maintainable test scripts. With Page Objects, your tests become more expressive, as you interact with elements using descriptive method names. This improves the overall clarity of your test cases.
Reusability: Page Objects are reusable components. When multiple tests interact with the same page, you can use the same Page Object in each test. If the page's structure changes, you only need to update the Page Object, not every test case.
Scalability: POM scales well with the size of your automation project. As you add more test cases and pages, the structured approach provided by POM keeps your codebase organized and maintainable.

Setting up your environment

Before we dive into implementing the Page Object Model (POM) in Selenium, it's crucial to ensure your development environment is properly configured. In this section, we'll cover the necessary prerequisites and guide you through creating a Python project for Selenium automation.

Prerequisites

To get started with Selenium and the Page Object Model, you'll need the following:

Python : Make sure you have Python installed on your system. You can download the latest version from the official Python website.

Selenium : Install the Selenium WebDriver library using Python's package manager, pip, by running the following command:

pip install selenium

Creating a Selenium Project

Once you have the prerequisites in place, you can create a new Python project for your Selenium automation work by following the steps below, or clone the GitHub repository with the final code for this tutorial.

Create a Project Directory : Create a directory in your desired location to store your Selenium project, and then navigate into that directory.
Initialize a Python Virtual Environment (Optional): It's a good practice to work within a virtual environment to isolate your project's dependencies. Inside the project directory we created in the previous step, create a virtual environment using the following command:
Install Selenium : Inside your virtual environment, install Selenium by running the following command:
WebDrivers : Selenium requires WebDriver executables for different browsers (e.g., Chrome, Firefox). You'll need to download the WebDriver for your preferred browser and ensure it's accessible from your system's PATH. You can find WebDriver downloads and installation instructions on the official Selenium website.
Create Python Files and organize your project : To organize our Selenium project, we will create Python files for Page Objects, test scripts, and any additional utilities we might require. We can structure our project by creating directories to categorize these components. This will help us keep our code base clean, easy to understand, and maintainable. As an example, here is the directory structure of the project we will work on during this article:

project_root/ page_objects/ login_page.py ... test_cases/ base_test.py test_login.py ... utils/ locators.py ... ...

Great, your environment is now set up and ready for Selenium automation with the Page Object Model. In the upcoming sections, we'll take a deeper look into the practical implementation of POM, starting with creating Page Objects to represent web pages.

Creating Page Objects

What is a Page Object?

A Page Object is a Python class that represents a specific web page or a component of a web page. It encapsulates the structure and behavior of that page, including the web elements (e.g., buttons, input fields) and the actions you can perform on them (e.g., clicking, typing). Page Objects promote code reusability and maintainability by providing a clean and organized way to interact with web elements.

So lets create our first Page Object:

Step 1: Define the Page Object class

Create a Python class for the web page you want to represent. Give it a meaningful name, typically ending with "Page," to indicate its purpose.

# pages/login_page.pyclass LoginPage(object): def __init__ (self, driver): self.driver = driver

In this example, we've created a LoginPage class.

Our goal will be to implement tests for a dummy login page (thanks to Dmitry Shyshkin for the website). We will create tests for three distinct scenarios:

Login successful : User entered valid credentials.
Invalid username : User entered an invalid username.
Invalid password : User entered an invalid password.

Step 2: Define web elements and actions

Now we need a way to access the web elements and actions from within the Page Object class. To keep things organized, we created a separate file under the utils directory to house all the locators we need:

# utils/locator.pyfrom selenium.webdriver.common.by import Byclass LoginPageLocators(object): USERNAME = (By.ID, 'username') PASSWORD = (By.ID, 'password') SUBMIT = (By.ID, 'submit') ERROR_MESSAGE = (By.ID, 'error')

Here, we've defined the elements USERNAME, PASSWORD, SUBMIT and ERROR_MESSAGE based on the elements IDs found on the target website.

Once this is done, we have to import locator.py and its contents into the login_page.py file.

# login_page.pyfrom utils.locators import *class LoginPage(object): def __init__ (self, driver): # Initialize the LoginPage object with a WebDriver instance. self.driver = driver # Import the locators for this page. self.locator = LoginPageLocators

Step 3: Implement methods

Still, within the login_page.py file, our task is to define methods that represent the interactions we want to happen on the web page.

All three previously discussed test cases involve attempting to log into an account. This login process essentially involves entering the username, and password, and then clicking the "Submit" button.

With these requirements in mind, we can design methods that precisely execute these actions. For example, the enter_username method locates the username input field and inputs the provided username using the send_keys function. The other methods in this class follow the same idea:

# login_page.pyfrom utils.locators import *from selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as EC# Define a class named LoginPage.class LoginPage(object): def __init__ (self, driver): # Initialize the LoginPage instance with a WebDriver object and locators. self.driver = driver self.locator = LoginPageLocators # Define a function to wait for the presence of an element on the page. def wait_for_element(self, element): WebDriverWait(self.driver, 5).until( EC.presence_of_element_located(element) ) # Define a function to enter a username into the corresponding input field. def enter_username(self, username): # Wait for the presence of the username input element. self.wait_for_element(self.locator.USERNAME) # Find the username input element and send the username string to it. self.driver.find_element(*self.locator.USERNAME).send_keys(username) # Define a function to enter a password into the corresponding input field. def enter_password(self, password): # Wait for the presence of the password input element. self.wait_for_element(self.locator.PASSWORD) # Find the password input element and send the password string to it. self.driver.find_element(*self.locator.PASSWORD).send_keys(password) # Define a function to click the login button. def click_login_button(self): # Wait for the presence of the login button element. self.wait_for_element(self.locator.SUBMIT) # Find the login button element and click it. self.driver.find_element(*self.locator.SUBMIT).click() # Define a function to perform a complete login by entering username and password. def login(self, username, password): self.enter_username(username) self.enter_password(password) self.click_login_button() # Define a function to perform a login with valid user credentials. def login_with_valid_user(self): self.login("student", "Password123") # Return a new instance of LoginPage after the login action. return LoginPage(self.driver) # Define a function to perform a login with an invalid username and return the error message. def login_with_invalid_username(self): self.login("student23", "Password123") # Wait for the presence of the error message element. self.wait_for_element(self.locator.ERROR_MESSAGE) # Return the text content of the error message element. return self.driver.find_element(*self.locator.ERROR_MESSAGE).text # Define a function to perform a login with an invalid password and return the error message. def login_with_invalid_password(self): self.login("student", "Password12345") # Wait for the presence of the error message element. self.wait_for_element(self.locator.ERROR_MESSAGE) # Return the text content of the error message element. return self.driver.find_element(*self.locator.ERROR_MESSAGE).text

You might have noticed that the three last methods are a little different. These methods use the high-level login method we defined to perform the login action with the specified username and password combinations. We will soon employ these methods to run tests to evaluate our test cases.

Writing test cases with POM

With the Page Object in place, we can now incorporate it into our test scripts. But first, in the pursuit of maintaining modularity and organization within our code, lets create a base_test.py file.

The purpose of this file is to serve as a repository for all the shared logic used across our tests. By centralizing this logic, we establish a convenient reference point whenever we need to generate new test files.

## base_test.pyimport unittestfrom selenium import webdriver# Define a test class named BaseTest that inherits from unittest.TestCase.class BaseTest(unittest.TestCase): # This method is called before each test case. def setUp(self): # Create a Chrome WebDriver instance. self.driver = webdriver.Chrome() # Navigate to the specified URL. self.driver.get("<https://practicetestautomation.com/practice-test-login/>") # This method is called after each test case. def tearDown(self): # Close the WebDriver, terminating the browser session. self.driver.close()# Check if this script is the main module to be executed.if __name__ == " __main__": # Run the test cases defined in this module unittest.main(verbosity=1)

Test case 1: Logging in with valid user credentials

Now that our base test is set up, we can begin developing the logic for our login test.

## test_login.pyfrom tests.base_test import BaseTestfrom pages.login_page import LoginPage# Define a test class named TestLogin that inherits from BaseTest.class TestLogin(BaseTest): # Define the first test method, which tests login with valid user credentials. def test_login_with_valid_user(self): # Initialize a LoginPage object with the self.driver attribute login_page = LoginPage(self.driver) # Call the login_with_valid_user method on the login_page object login_page.login_with_valid_user() # Use self.assertIn to check if the string "logged-in-successfully" # is present in the current URL of the driver. If present, the test passes. self.assertIn("logged-in-successfully", self.driver.current_url)

The defined method test_login_with_valid_user serves as a test for our initial scenario: logging in using valid user credentials. For the test to succeed, we should see the text "logged-in-successfully" in the URL of the webpage right after submitting our credentials. If thats the case, a positive test feedback message will be printed in our terminal.

To run the method, type the following command in your terminal:

python3 -m unittest tests.test_login.TestLogin.test_login_with_valid_user

Test case 2: Logging in with an invalid username

With the method for our first test case out of the way, lets move on to the second scenario: logging in with an invalid username.

## test_login.py# ...# Define the second test method, which tests login with an invalid username. def test_login_with_invalid_username(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_username method on the login_page object. # Assign the result to the variable result (error message). result = login_page.login_with_invalid_username() # Use self.assertIn to check if the string "Your username is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your username is invalid!", result)

The method test_login_with_invalid_username tests for the second scenario: trying to log in using an invalid username. For the test to succeed, we should see the error message "Your username is invalid!" displayed on the screen right after clicking the Submit button. If thats the case, the test passes.

To run the method, type the following command in your terminal:

python -m unittest tests.test_login.TestLogin.test_login_with_invalid_username

Test case 3: Logging in with an invalid password

Similar to the previous method, the method checks for a particular error message that should be displayed when the user enters a valid username together with an invalid password. The logic is almost the same, except that, this time, we should expect a different error message to be displayed.

# login_test.py# ...# Define the third test method, which tests login with an invalid password. def test_login_with_invalid_password(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_password method on the login_page object. # Assign the result (error message) to the variable result. result = login_page.login_with_invalid_password() # Use self.assertIn to check if the string "Your password is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your password is invalid!", result)

The method test_login_with_invalid_password tests for the third scenario: trying to log in using an invalid password. For the test to be successful, we should see the error message "Your password is invalid!" displayed on the screen immediately after clicking the "Submit" button. If this message appears, it signifies a passing test.

To run the method, type the following command in your terminal:

python -m unittest tests.test_login.TestLogin.test_login_with_invalid_password

Running all tests

Now that we have all three methods ready, we may want to execute them all together to test all of our test cases simultaneously. Here is the complete code:

from tests.base_test import BaseTestfrom pages.login_page import LoginPage# Define a test class named TestLogin that inherits from BaseTest.class TestLogin(BaseTest): # Define the first test method, which tests login with valid user credentials. def test_login_with_valid_user(self): # Initialize a LoginPage object with the self.driver attribute, # which is likely a WebDriver instance for interacting with web pages. login_page = LoginPage(self.driver) # Call the login_with_valid_user method on the login_page object, # which is expected to perform a login action with valid credentials. login_page.login_with_valid_user() # Use self.assertIn to check if the string "logged-in-successfully" # is present in the current URL of the driver. If present, the test passes. self.assertIn("logged-in-successfully", self.driver.current_url) # Define the second test method, which tests login with an invalid username. def test_login_with_invalid_username(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_username method on the login_page object. # Assign the result (likely an error message) to the variable result. result = login_page.login_with_invalid_username() # Use self.assertIn to check if the string "Your username is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your username is invalid!", result) # Define the third test method, which tests login with an invalid password. def test_login_with_invalid_password(self): # Initialize a LoginPage object with the self.driver attribute. login_page = LoginPage(self.driver) # Call the login_with_invalid_password method on the login_page object. # Assign the result (likely an error message) to the variable result. result = login_page.login_with_invalid_password() # Use self.assertIn to check if the string "Your password is invalid!" is # present in the result. If present, the test passes. self.assertIn("Your password is invalid!", result)

To run all methods in the TestLogin class at once, type the following command in your terminal:

python -m unittest tests.test_login.TestLogin

After a few seconds, you should see a similar message displayed on your terminal:

Handling page navigation and dynamic elements

In web testing and automation, it's common to encounter scenarios where web pages have dynamic elements, or your test cases require navigation between different pages. The Page Object Model (POM) provides an organized way to handle these challenges.

Handling dynamic elements

Dynamic elements are elements on a web page that may load or change after the initial page load. Examples include elements that appear after a delay, elements generated via JavaScript, or elements with dynamic IDs or attributes.

To handle dynamic elements with POM:

Include Dynamic Elements in Page Objects : In your Page Object class, include dynamic elements as attributes. You can locate these elements using Selenium locators just like any other element.
Use Explicit Waits : To ensure that dynamic elements are fully loaded before interacting with them, use Selenium's explicit waits. Explicit waits allow you to wait for specific conditions to be met before proceeding with the test.

Here's an example of how we used an explicit wait within our login Page Object to enhance the reliability of the tests we've just created:

## login_page.pyfrom utils.locators import *from selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECclass LoginPage(object): def __init__ (self, driver): self.driver = driver self.locator = LoginPageLocators # Define a function to wait for the presence of an element on the page. def wait_for_element(self, element): WebDriverWait(self.driver, 5).until( EC.presence_of_element_located(element) ) def enter_username(self, username): # Wait for the presence of the username input element. self.wait_for_element(self.locator.USERNAME) self.driver.find_element(*self.locator.USERNAME).send_keys(username)# ...

In this example, the wait_for_element method waits for the element to be present using an explicit wait before running the rest of the code.

Handling page navigation

With POM, you can encapsulate page navigation within Page Objects, making your test scripts more modular.

To handle page navigation with POM:

Step 1: Include Navigation Methods in Page Objects : Create methods within your Page Objects for navigating to other pages. For example, you can have a go_to_dashboard method in a HomePage Page Object.

# home_page.pyclass HomePage: # ... def go_to_dashboard(self): self.driver.find_element(*self.dashboard_link).click()

Step 2: Reuse Page Objects : After navigating to a new page, you can create an instance of the corresponding Page Object to continue interacting with that page. This promotes code reusability and maintains a clear structure.

Here's an example of navigating from the login page to the dashboard page using Page Objects:

# Import necessary Page Objectsfrom login_page import LoginPagefrom home_page import HomePage# ...# Instantiate LoginPage Page Object and perform loginlogin_page = LoginPage(driver)login_page.enter_username('your_username')login_page.enter_password('your_password')login_page.click_login_button()# Instantiate HomePage Page Object after successful loginhome_page = HomePage(driver)# Navigate to the dashboard pagehome_page.go_to_dashboard()# Create a DashboardPage Page Object to interact with the dashboarddashboard_page = DashboardPage(driver)# Perform actions on the dashboard pagedashboard_page.view_orders()dashboard_page.logout()# ...

By encapsulating page navigation and dynamic element handling within Page Objects, you maintain a structured and organized approach to your Selenium automation, making your test scripts more robust and maintainable.

Running tests and reporting

Running your Selenium tests and generating reports are essential steps in any automation project. So far, weve been using running our tests using unittest . While Selenium test runners provide basic feedback, it's helpful to generate more informative test reports. We can achieve this by integrating test reporting libraries or frameworks.

For example, we can use pytest and the pytest-html plugin to create basic HTML test reports for better visibility into our automation results.

Generating Basic Test Reports

Install pytest-html:

pip install pytest-html

Run Tests with pytest and Generate HTML Report:

pytest --html=report.html test_login.py

This command will run your tests and generate an HTML report named report.html.

View the HTML report :

Open the generated HTML report in a web browser to see detailed test results, including passed and failed test cases, error messages, and timestamps.

This basic reporting setup provides a visual representation of our test execution, making it easier to identify issues and share results with our team.

Remember that there are more advanced reporting and test management tools available that you can integrate into your automation framework for more comprehensive reporting, such as Allure, TestNG, or ExtentReports. But thats a topic for another article 😉

Selenium Grid: what it is and how to set it up

Learn about the Selenium Grid architecture and its use in large test suites, cross-browser testing, and continuous integration.

blog.apify.com

Selenium WebDriver: how to handle iframes

Learn how to tackle iframes in Selenium WebDriver. Practical tips for switching frames and interacting with elements.

  <div class="color-secondary fs-s flex items-center">
      <img
        alt="favicon"
        class="c-embed__favicon m-0 mr-2 radius-0"
        src="https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png"
        loading="lazy" />
    blog.apify.com
  </div>
</div>

Web scraping with Selenium and Python

A guide to web scraping in Selenium with code examples.

  <div class="color-secondary fs-s flex items-center">
      <img
        alt="favicon"
        class="c-embed__favicon m-0 mr-2 radius-0"
        src="https://blog.apify.com/content/images/size/w256h256/2021/03/favicon-128x128.png"
        loading="lazy" />
    blog.apify.com
  </div>
</div>

6 things you should know before buying or building a web scraper

Percival Villalva — Tue, 22 Aug 2023 22:00:00 +0000

We've been scraping the web at Apify for almost eight years now. We've built our cloud platform, a popular open-source web scraping library, and hundreds of web scrapers for companies large and small. Thousands of developers from all over the world use our tech to build reliable scrapers faster, and many even sell them on Apify Store.

But we also failed countless times. We lost very large customers to dumb mistakes, and we struggled with unlocking value for many of our users thanks to misaligned expectations. Perhaps we were naive, or just busy with building a startup, but we often forgot to realize that our customers were not experts in web scraping and that the things we thought obvious, were very new and unexpected to them.

The following 6 things you should know before buying or building a web scraper are a concentrated summary of what we've learned over the years and what we should've been telling our customers from Apifys day one.

If you dont have a lot of experience with web scraping, you might find some of these things unexpected, or even shocking, but trust me, it's better to be shocked now, than two months later, when your expensive scraper suddenly stops working.

1. Every website is different

Even though, to the human eye, a website looks like a website, underneath the buttons, images, and tables they're all very different. That makes it hard to estimate how long a web scraping project will take, or how expensive it will be, before taking a thorough look at the target websites.

With regular web applications, the complexity of a project is determined by your requirements and the features you need. With web scrapers, it's driven mostly by the complexity of the target website, which you have no control over. To determine the features the scraper will need to have, and also to identify potential roadblocks, developers must first analyze the website.

Common factors of web scraper complexity

Some of the most important factors that influence the cost and time to completion of a web scraping project are the following:

Anti-scraping protections

Even though web scraping is perfectly legal, websites often try to block traffic they identify as coming from a web scraper. It's therefore essential for the scraper to appear as human-like as possible. This can be achieved using headless browsers and clever obfuscation techniques, but many of them increase the price of a project by orders of magnitude. A good initial analysis will identify the protections and provide a cost-efficient plan for overcoming them. Great web scraper developers are already familiar with most of the protections out there and they can reliably avoid being blocked.

There are ready-made APIs on the market that promise to overcome almost any protection and blocking. Frankly, they're often quite good at it, and they make a lot of sense when you need results immediately, or when you're looking at low volumes. But at high volumes and for recurring use cases, your total cost of ownership will skyrocket. They're a great tool, but should not be used blindly.

Architecture of the website

Some websites can be scraped very quickly and cheaply with simple HTTP requests and HTML or JSON parsing. Other websites require a headless browser to access their data. Headless browser scrapers need a lot of CPU power and memory to operate, which makes them 10-20 times more expensive to run than HTTP scrapers. A great web scraper developer will always try to find a way to use an HTTP scraper by reverse engineering the website's architecture, but unfortunately, it's not always possible.

Website updates and redesigns are out of your control and if an update introduces a new anti-scraping protection or changes the architecture, the costs of scraping may change dramatically. Or not at all. Sadly, you never know. The good news is that large website upgrades are fairly rare.

How fast and often do you need the data

Speed and frequency of web scraping have a profound impact on complexity. The faster you scrape, the more difficult it is to appear like a human user. Not only do you need more IP addresses, and more device fingerprints, but with super large-scale scraping, you also have the non-trivial engineering overhead of managing and synchronizing tens or hundreds of concurrently running web scrapers.

Finally, there's the issue of overloading the target website. Amazon.com can handle significantly more scraping than your local dentist's website, but in many cases, it's not easy to figure out how much traffic a website can handle safely if you want to scrape ethically. Great web scraper developers know how to pace their scrapers. When they make a mistake, they can quickly identify it and immediately downscale the scraping operation.

Apify provides web scraper developers with sophisticated tooling that helps them analyze websites quickly, overcome anti-scraping protections, and deploy changes nearly instantly at any scale.

2. Websites change without asking

Web scrapers are different from traditional software applications because even the best web scraper can stop working at any moment, without warning. The question is not if, but when.

Web scrapers break because they are programmed to understand the structure of the websites they visit, and if the structure changes, the scraper will no longer be able to find the data it's looking for. Sometimes a human can't even spot the difference, but any time the website's HTML structure, APIs, or other components change, it can cause a scraping disruption.

Professional web scraper programmers can reduce the chances of a scraper breaking, but never to zero. When a brand launches a full website redesign, the web scraper will need to be programmed again, from scratch. Reliable web scraping therefore requires constant monitoring of the target websites and of the scraper's performance.

Some types of websites, like e-commerce stores or news sites, can be scraped using AI-enabled web scrapers that handle website changes automatically. But just like any AI, their results are only 80-90% correct. You have to decide if that's enough for your project or not.

How can you handle website changes?

First, think about how critical the data is for you on a scale from 1 to 3, with 1 being nice-to-have and 3 mission-critical. Think both in terms of how fast you need the data as well as about its importance. Do you need the data for a one-off data analysis? A monthly review? A real-time notification system? If you can't get the data in the expected quality, can you postpone the analysis, or will your production systems and integrations fail?

1: Nice-to-have data

That's data you can either wait for or dont care about missing out on. For those projects, it's best to simply accept as a fact that the websites can change before your next scrape and that you might have to fix the scraper. Regular maintenance often does not make sense, because it adds extra recurring costs that could exceed the price of a new scraper. The best course of action is to try if the scraper still works, and if it doesn't, order an update or fix it yourself. This might take a few hours, days, or weeks, depending on the complexity of the scraper and how much money you want to spend.

2: Business-critical data

This is data that's important for your project, but not having it at the right time and in the right quality won't threaten the existence of the project itself. For example, data for a weekly competitive pricing analysis. You won't be able to price as efficiently without it, but your business will continue. Most web scraping projects fall into this category. Here the best practice is to set up a monitoring system around the scrapers and to have developers ready to start fixing issues right away.

The monitoring system must serve two functions. First, it must notify you in real-time when something is wrong. Some scrapers can run for hours or days, and you dont want to wait until the end of the scrape to learn that all your data is useless. Second, it must provide detailed information about what is wrong. Which pages failed to be scraped, how many items are invalid, which data points are missing, and so on.

A robust monitoring system gives developers an early warning and valuable information to fix the web scraper as soon as possible. Still, if you dont have developers at hand to start debugging right away, the monitoring itself can't save you. It doesn't matter if you source your developers in-house or from a vendor, but ideally, you should have a dedicated developer (or a team) ready to jump in within a matter of hours or a day. Any experienced developer will do, but if you want a quick and reliable fix, you need a developer that's familiar with your scraper and the website it's scraping. Otherwise, they'll spend most of their time learning how the scraper works, which dramatically increases the cost of the update.

A good monitoring system and dedicated developers come at a price (internal or external), but from our experience, they are necessary to ensure the reliable operation of business-critical web scrapers. It's similar to an insurance policy. A small regular payment, instead of risking a high-cost incident down the line. Dont worry though. You definitely dont need one full-time dev per scraper. Just someone who knows the project and can jump in at short notice to do a few hours of work.

3: Mission-critical data

This category includes data that your business simply can't live without, but also high-frequency analytics data, that have strict timing requirements. For example, data that needs to be scraped every day, and also delivered the same day before 5 a.m. When scraping the data usually takes 2 hours, the scraper starts at midnight, and your monitoring system reports that the scraper is broken, Who's going to fix it between 3 a.m. and 5 a.m.?

The best practice for high-quality mission-critical data is regular monitoring and maintenance of the scrapers, proactive testing, and performance analytics. Many issues with web scrapers can be caught early by regular health checks. Those are small scrapes that test various features of the target website. Depending on the requirements of the project, they can run every day, hour, or minute, and will give developers the earliest possible warning about issues they need to investigate.

Apify includes a robust monitoring system with all plans. You can monitor scraper run times, costs, numbers of results, and many other metrics, and get notified immediately when your thresholds are missed. For customers of our professional services, Apify offers SLAs with monitoring, dedicated developer capacity, and guaranteed uptime.

3. Small changes in web scraper specifications can cause dramatic changes in cost

This applies to all software development, but maybe even more so to web scrapers, because the architecture of the scraper is largely dictated by the architecture of the target website, and not by your software architects. Let's illustrate this with an example.

You want to build a web scraper that collects product information such as name, price, description, and stock availability from an e-commerce store. Your developer then finds that there are 1,000,000 products in the store. To scrape the information you need, the scraper has to visit ~1,000,000 pages. Using a simple HTTP crawler this will cost you, let's say, $100. Assuming you want fresh data weekly, the project will cost you ~$400 a month.

Then you think it would be great to also get the product reviews. But to get all of them, the scraper needs to visit a separate page for each product. This doubles the scraping cost because the scraper must visit 2 million pages instead of 1 million. That's an extra ~$400 for a total of ~$800 a month.

Finally, you realize that you would also like to know the delivery cost estimate that's displayed right under the price. Unfortunately, this estimate is computed dynamically on the page using a third-party service, and your developer tells you that you will need a headless browser to do the computation. This increases the price of scraping product details ~20 times. From $100 to $2,000.

In total, those two relatively small adjustments pushed the monthly price of scraping from ~$400 to ~$8,400.

The best approach to avoid costly surprises like that is thoroughly clarifying your requirements, both internally and externally. Focus on the outcomes you seek, rather than the features. Experienced Apify consultants can help you prepare a great specification that will deliver the outcomes at the most efficient price point.

**4. There are legal limits to what you can scrape**

Even though web scraping is perfectly legal, there are rules and regulations every web scraper must follow. In short, you need to be careful when you scraping:

personal data (emails, names, photos of people, birthdates )
copyrighted content (videos, images, news, blog posts )
data thats only available after signing up (accepting terms of use)

Web scraper consultants and developers can give you guidance on your project, but they can't replace professional advice from your local lawyer. Laws and regulations are very different across the world. Web scraping professionals have seen a fair share of projects, and they can reasonably guess whether your own project will be regulated or not, but they're not globally certified lawyers.

If you want to learn more, I have written an extensive guide that covers the legality of web scraping. It includes detailed explanations of the above categories of data, up-to-date case law, and actionable tips that will help you decide whether you could benefit from talking to a lawyer.

5. Start with a proof of concept for your web scraper

Even though you might want to kick off your new initiative and reap the benefits of web data at full scale right away, I strongly recommend starting with a proof of concept or an MVP.

As I explained earlier, all web scraping projects venture into uncharted territory. Websites are controlled by third parties and they dont guarantee any sort of uptime or data quality. Sometimes you'll find that they're grossly over-reporting the number of available products. Other times the website changes so often, that the web scraper maintenance costs become unbearable. Remember Twitter (now X) and their shenanigans with the public availability of tweets.

The inherent unpredictability of web scraping can be mitigated by approaching it as an R&D project. Build the minimal first version, learn, and iterate. If you're looking to scrape 100 competitors, start by validating your ideas on the first 5. Choose the most impactful ones or the ones that your developers view as the most difficult to scrape. Make sure the ROI is there, and empowered by the learnings, start building the next batch of websites. You will get better results, faster, and at a more competitive price this way I promise.

The Apify team recommends starting with a PoC on all projects. Even high-profile customers are often uncertain about the outcomes web scraping can bring to their organizations. Starting small helps them get buy-in from key stakeholders and onboard their teams properly. The customers also appreciate the flexibility a PoC allows, because they can start seeing results in a matter of days or weeks, and if something doesn't add up on their end, they can quickly pivot the project or request changes in the specification.

6. Prepare for turning data into insights

Mining companies are experts in mining ore. Web scraping companies and developers are experts in mining data. And just like mining companies aren't the best vendors for building bridges or car engines using the mined ore, web scraping companies often have limited experience with banking, automotive, fashion, or any other complex business domain.

Before you buy or build a web scraper, you must ask yourself whether your vendor or your team has the skills and the capacity to turn the raw data into actionable insights. Web scraping is only the first part of the process that unlocks new business value.

It happened from time to time to our customers that they simply weren't ready to process the vast amount of data web scraping offered. This led to sunk costs and the downscaling of their projects over time. They had the data, but they could not find the insights, which led to poor ROI.

At Apify we actively ask our customers about their domain expertise expectations before starting any custom project. In situations where we miss the relevant skills in our team, we transparently leverage our partners. They specialize in specific domains like competitive intelligence, natural language processing, or web application development.

We also require the customers of our professional services to dedicate internal resources to the project. Without that, it's unlikely that a web scraping project will succeed in the long term.

Anything else?

Yes, there are about a million things that can go wrong in a web scraping project. But that's true in any field of human activity. Whether you choose to develop with open-source tools, use ready-made web scrapers or buy a fully-managed service, a little due diligence will go a long way. And if you understand and follow the 6 recommendations above, I'm confident that your web scrapers will be set up for long-term success.

Selenium Grid: what it is and how to set it up

Percival Villalva — Wed, 02 Aug 2023 22:00:00 +0000

Explore Selenium Grid use cases in large test suites, cross-browser testing, and continuous integration. Check the steps for setting up Selenium Grid and practical tips for efficient parallel test execution.

What is Selenium Grid?

Selenium Grid is a powerful tool that enhances the efficiency of Selenium test automation by allowing tests to be executed in parallel across multiple machines and web browsers. It acts as a test execution environment where tests can be distributed and run on various Selenium Grid Nodes simultaneously.

What are the benefits of using Selenium Grid?

Its distributed testing capability makes Selenium Grid an invaluable resource for reducing test execution time and achieving faster feedback in the development cycle.

Reduced test execution time: with parallel test execution, Selenium Grid significantly reduces the time required to execute test suites, as multiple tests run concurrently on different Nodes.
Improved test coverage: Selenium Grid enables testing on various browser and operating system combinations, ensuring better test coverage and identifying cross-browser compatibility issues.
Cost-effective: by leveraging existing infrastructure and reusing test scripts, Selenium Grid helps optimize resource utilization, making it cost-effective for large-scale test automation.
Efficient test feedback: faster test execution and parallelization provide quicker feedback to developers, enabling them to identify and fix issues promptly.
Scalability: Selenium Grid's distributed architecture allows easy scaling by adding more Nodes, accommodating increased testing demands as projects grow.

When to use Selenium Grid?

Selenium Grid becomes particularly advantageous in scenarios where its distributed testing capabilities can significantly enhance test automation efficiency and effectiveness.

Large test suites and parallel execution

When dealing with extensive test suites that take a long time to execute sequentially, Selenium Grid can parallelize test execution across multiple Nodes. This dramatically reduces the overall test execution time and provides faster feedback to the development team.

Cross-browser and cross-platform testing

To ensure your web application functions correctly across different browsers and operating systems, Selenium Grid allows you to execute tests on a variety of browser configurations concurrently. This helps identify compatibility issues early in the development process.

Scaling test infrastructure

As your test automation requirements grow, Selenium Grid facilitates horizontal scaling by adding more nodes to the grid. This scalability ensures that your test infrastructure can accommodate increased testing demands without sacrificing execution speed.

Continuous integration (CI) pipelines

In CI/CD pipelines, where frequent code changes trigger automated testing, Selenium Grid's parallel execution capability becomes indispensable. It allows you to execute multiple tests simultaneously on various Nodes, speeding up the testing process and ensuring rapid feedback.

Geographically distributed testing

When your application caters to users from different regions, Selenium Grid can set up Nodes on geographically distributed machines. This approach allows you to verify the application's functionality and performance in different network conditions and locations.

Example scenario: e-commerce website testing

Consider an e-commerce website with a large number of test cases to validate its functionalities. Without Selenium Grid, running these tests sequentially could take hours. However, by leveraging Selenium Grid, we can divide the test suite across multiple Nodes, each capable of testing on different browser and OS combinations. As a result, we can significantly reduce test execution time, enabling faster feedback for developers and testers.

Selenium Grid architecture

Selenium Grid's architecture is designed to facilitate parallel test execution across multiple Nodes, enabling efficient distribution of test cases and providing faster results. The architecture consists of two main components: the Hub and the Nodes.

The Hub

The Selenium Grid Hub serves as the central control point for test execution. It receives test requests from clients (test scripts) and manages the distribution of these tests to available Nodes. The Hub acts as a mediator between clients and Nodes.

Nodes

Nodes are the execution environments where tests run. Each Node registers itself with the Hub, indicating its availability for test execution. Nodes can be configured with various browser and OS combinations, offering a diverse testing environment.

Communication flow

A test script (client) requests a new session from the Hub by specifying the desired browser and platform (e.g., Chrome on Windows).
The Hub examines its registry of available Nodes and forwards the test request to an appropriate Node capable of fulfilling the desired capabilities.
The selected Node launches the specified browser with the desired configuration and establishes a new WebDriver session.
The test script communicates with the browser via the WebDriver session on the Node for test execution.
Test results and status are reported back to the Hub, which forwards them to the client.

Load balancing

Selenium Grid employs a load balancing mechanism to ensure efficient utilization of available Nodes. When multiple Nodes with similar desired capabilities are present, the Hub distributes test requests across these Nodes, optimizing resource usage and reducing test execution time.

Handling failures

Selenium Grid provides robust mechanisms to handle Node failures gracefully. If a Node becomes unresponsive during test execution, the Hub reassigns the affected test cases to other available Nodes, ensuring that the overall test suite continues running smoothly.

Web browser drivers

Each Node in Selenium Grid must have the corresponding web browser driver installed. For example, if a Node is configured to run tests on Chrome, it should have the ChromeDriver installed and properly configured.

How to set up Selenium Grid

Prerequisites

Java 11 or higher installed (download link)
Browser(s) installed (e.g., Chromium, Firefox, Safari)
Browser drivers (e.g., ChromeDriver, GeckoDriver)
Add the browsers location to the system PATH or place them in a directory accessible to the Python scripts.
Download the Selenium Server jar file from the latest release (at the time of writing of this article, the latest release was Selenium Server version 4.10.0)

Start Selenium Grid Hub

Open a terminal or command prompt and run the following command to start the Selenium Grid Hub, making sure to start the grid from the same folder where the jar file is located:

java -jar selenium-server-<version>.jar hub

👉 If you have downloaded a jar file from a different version of Selenium Server, replace the in the code with your own.

Once the command finishes running, we will receive a message indicating that the Hub has been successfully started:

Example of message that Hub has been started

To ensure that everything worked as intended, visit the local URL where the Selenium Grid Hub started. Since we have not registered any Nodes yet, you should see a screen similar to the one below:

Make sure that everything works correctly by visiting the local URL

Add Nodes to the Hub

At startup, the Node scans the System PATH to identify and make use of the available drivers. This allows the Node to access the necessary browser drivers for test execution. Note that the provided command assumes that both the Node and the Hub are running on the same machine.

java -jar selenium-server-<version>.jar node

You can register multiple Nodes with different desired capabilities to test on various browser configurations. For example, we can register an additional Node with the port 6666

java -jar selenium-server-4.10.0.jar node --port 6666

Verify Hub and Nodes

To verify what Nodes we have open on our Hub, we can open a web browser and navigate to http://localhost:4444/grid/ui. This will display the Selenium Grid console, showing the registered Hub and Nodes.

Verify what Nodes are open on the Hub

Running tests using Selenium Grid

Now, let's create a simple test script in Python to demonstrate how to run tests using Selenium Grid. In this example, we will use the Selenium WebDriver with Python to open Apify Store and extract the text content of its title, and description on multiple browsers simultaneously.

from selenium import webdriver
from selenium.webdriver.common.by import By

# Define the URL for the Selenium Grid hub
hub_url = '<http://192.168.1.221:4444>'

# Create browser options for Chrome
chrome_options = webdriver.ChromeOptions()

# Create browser options for Firefox
firefox_options = webdriver.FirefoxOptions()

# Connect to the Selenium Grid hub and create a remote WebDriver instance

# Chrome
driver_chrome = webdriver.Remote(
    command_executor=hub_url,
    options=chrome_options
)

driver_chrome.get("<https://apify.com/store>")

chrome_data = {
    "page_title": driver.find_element(By.CSS_SELECTOR, "header > div > h1").text,
    "page_description": driver.find_element(By.CSS_SELECTOR, "header > div > p").text
}

print(chrome_data)

# Firefox
driver_firefox = webdriver.Remote(
    command_executor=hub_url,
    options=firefox_options
)

driver_firefox.get("<https://apify.com/store>")

firefox_data = {
    "page_title": driver.find_element(By.CSS_SELECTOR, "header > div > h1").text,
    "page_description": driver.find_element(By.CSS_SELECTOR, "header > div > p").text
}

print(firefox_data)

driver.quit()

The code snippet above demonstrates a test scenario where the page title and description are extracted from the Apify Store website using both Chrome and Firefox browsers.

With this code, we can run the same test on different browsers concurrently by creating separate WebDriver instances for each browser and connecting them to our Selenium Grid Hub.

Handling test failures

In distributed testing scenarios, it is crucial to handle test failures efficiently. If a Node becomes unresponsive during test execution, the Hub automatically reassigns the affected test cases to other available Nodes, ensuring the overall test suite continues running smoothly.

Parallel test execution tips

Organize test cases in a way that allows for easy parallelization and avoids dependencies between tests.
Divide your test suite into smaller chunks to distribute the load evenly across Nodes.
Consider setting up a dedicated test infrastructure for Selenium Grid to ensure stability and optimal performance.

If you're not sure that Selenium is the right testing framework for you, check out this detailed post on Cypress vs. Selenium. Or find out whether Selenium is the best choice for web scraping in Playwright vs. Selenium.

Python and machine learning

Percival Villalva — Sun, 30 Jul 2023 22:00:00 +0000

Learn how Python and machine learning intersect to solve complex problems that defeat traditional programming methods. Find out about Pandas, TensorFlow, Scikit-learn, and how they can transform data.

What is machine learning?

Machine learning, a subset of artificial intelligence (AI), is a rapidly evolving field with numerous practical applications in various domains. Recently, the popularity and impact of AI, exemplified by advancements like ChatGPT, have boosted interest in the field and its potential to enhance our daily lives. But what exactly is machine learning and when would we want to use it? And how does Python fit in with machine learning?

To answer these questions, let's consider an example to understand its significance. Imagine you're tasked with developing a program to analyze an image and determine whether it contains a cat, a dog, or another animal. To accomplish such a broad task, traditional programming techniques would quickly lead to overwhelming and time-consuming complexity. Devising multiple rules to detect curves, edges, and colors in the image would be prone to flaws. For example, black-and-white photos would require rule revisions, and unanticipated angles of cats or dogs would make any rules we create ineffective. In other words, attempting to solve this problem through traditional programming methods would prove excessively complicated or even impossible.

And this is where machine learning comes into play. It offers a technique for us to address such problems effectively. Instead of relying on explicit programming rules, we can construct a model or an engine and provide it with an abundance of data. For instance, to solve our dogs and cats problem, we could supply thousands or even tens of thousands of pictures of cats and dogs to a model that would then analyze this input data and learn its patterns autonomously.

Now, suppose we present the model with a new, unseen picture of a cat and inquire whether the picture depicts a cat, a dog, or a horse. The model, based on its learned patterns, will provide us with a response, accompanied by a certain level of accuracy. The more data we feed into the model the better its accuracy becomes, especially if the data is relevant and high quality.

Although this example is simplistic, machine learning has extensive applications, including self-driving cars, robotics, natural language processing, image recognition, and forecasting, such as predicting stock market trends or weather patterns.

How Python and machine learning come together

That all sounds great, but what can we use to build those models? While there is no single best programming language for machine learning, Python has emerged as the de facto language for machine learning due to its simplicity, flexibility, and vibrant ecosystem of libraries and tools.

In this article, we will explore the best Python libraries for developing machine-learning models, such as Pandas, TensorFlow, Scikit-learn, and more, to understand their role in the various stages of the machine-learning process.

5 steps in developing a machine learning model with Python

Developing a machine learning model involves several essential steps that collectively form a pipeline from data preparation to model deployment. Understanding these steps is crucial for building effective and accurate machine-learning models. Let's take a quick look at each step and what popular Python libraries we could use to fulfill the requirements of each step:

1. Data preparation and exploration

Data preparation and exploration lay the foundation for any successful machine-learning project. This step involves tasks such as data cleaning , handling missing values , feature scaling , and data visualization. Properly preparing and exploring the data can help identify patterns, outliers, and relationships that will influence the model's performance.

To accomplish this step, we can leverage libraries such as:

Pandas: In the context of machine learning, Pandas is a crucial tool for handling and analyzing structured data. By leveraging its powerful data structures, such as DataFrames, we can efficiently manipulate and transform datasets. To that end, Pandas provides an extensive range of functions for data cleaning, handling missing values, and performing descriptive statistics. These capabilities are crucial in the data preparation phase of machine learning, enabling us to preprocess the data, remove outliers, impute missing values, and extract meaningful insights.
Matplotlib: As a widely-used plotting library, Matplotlib offers a versatile set of visualization techniques, including line plots, scatter plots, and histograms. These visualizations are invaluable in the, help researchers identify patterns, trends, and anomalies in the dataset in the data exploration phase. By visualizing the data, machine learning practitioners can make informed decisions about feature engineering, data preprocessing, and model selection.

❗ The code examples provided in this article are for demonstration and educational purposes only and should not be considered production-ready.

To get an idea of how we would go about this step, let's consider a situation where we use Pandas to explore and visualize data retrieved from a CSV file:

import pandas as pd

# Load the dataset
data = pd.read_csv('sample_data.csv')

# Explore the data
print(data.head()) # Display the first few rows
print(data.describe()) # Get statistical summary
print(data.info()) # Get information about the columns

# Handle missing values
data = data.fillna(0) # Replace missing values with 0

# Visualize the data
data['age'].plot.hist() # Plot a histogram of the age column
data.plot.scatter(x='income', y='purchase') # Create a scatter plot of income vs. purchase

To obtain high-quality datasets for machine learning, there are several options available. One approach is to download existing datasets from machine learning communities like Kaggle, where you can find a wide range of datasets for free.

Alternatively, if you require a dataset tailored to your specific project, web scraping can be an effective solution. Web scraping platforms like Apify offer access to numerous pre-built scrapers in Apify Store, allowing you to extract data from data-rich websites such as Google Maps, YouTube, and Meta's Threads. Additionally, for those interested in flexing their web scraping skills, building and deploying custom scrapers is an option.

2. Feature engineering and selection

Feature engineering involves transforming raw data into meaningful features that capture the underlying patterns and relationships. This step often requires domain expertise and creativity. Feature selection aims to identify the most relevant features for the model, reducing complexity and improving efficiency.

To assist with feature engineering and selection, we can utilize libraries such as:

Scikit-learn: Scikit-learn offers a wide range of feature extraction and transformation techniques. It helps us handle different data types, encode categorical variables for numerical representation, scale numerical features, generate new informative features, and perform feature selection to improve model performance. In short, Scikit-learn streamlines feature engineering, making data preprocessing and transformation easier, resulting in more effective machine learning models.
Featuretools: Featuretools is a library designed for automated feature engineering in machine learning. It enables us to create new features by combining existing ones, making it easier to capture complex relationships and patterns in the data.

To illustrate how this step let's consider a text classification task where we want to classify news articles into different categories. We can use Scikit-learn to preprocess the text data, convert it into numerical features, and select the most important features using the TF-IDF (Term Frequency-Inverse Document Frequency) method.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2

# Preprocess the text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)

# Select the most important features
selector = SelectKBest(chi2, k=1000)
X_selected = selector.fit_transform(X, labels)

3. Model building and training

Model building involves selecting an appropriate algorithm or model architecture to solve the problem at hand. Python offers a wide range of algorithms and models, each suited for different types of problems. Once the model is chosen, it needs to be trained on labeled data to learn the patterns and make accurate predictions.

To build and train machine learning models, we can rely on libraries such as:

Scikit-learn : Scikit-learn not only can help us with step 2 (Feature Engineering and Selection) but it also offers a consistent API that facilitates the training process with functions for model fitting, hyperparameter tuning, and model serialization.
TensorFlow: TensorFlow is a popular deep-learning framework that allows us to build and train neural networks for various tasks. It offers a wide range of pre-built neural network architectures and supports custom model creation. TensorFlow provides efficient computation on GPUs and TPUs, enabling faster training for large-scale models.

To illustrate this, lets take a look at how we would implement this step in a real project using Scikit-learn and TensorFlow.

Let's take a classification problem as an example. We can use logistic regression from Scikit-learn to train a model on labeled data and make predictions on new, unseen data.

from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

Now lets take it a step further and see how we can use TensorFlow not only to build and train the model but also to make predictions and, finally, deploy it.

For example, imagine we are building a handwritten digit recognition system. The neural network architecture defined in the code below could be trained on a dataset of handwritten digit images along with their corresponding labels. Once trained, the model can make predictions on new, unseen digit images, accurately classifying them into their respective digits (0 to 9).

Then, the trained model can be saved and deployed in a production environment, where it can be integrated into a larger application or used as an API to provide digit recognition functionality to end users.

import tensorflow as tf

# Creating a simple neural network
model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compiling the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Training the model
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))

# Making predictions
predictions = model.predict(x_test)

# Deploying the model
model.save('model.h5')

4. Model evaluation and validation

After training the model, it is essential to assess its performance and validate its ability to generalize well on unseen data. Evaluation metrics such as accuracy , precision , recall , and F1 score provide insights into the model's effectiveness. Validation techniques like cross-validation help estimate how well the model will perform in the real world.

Before we get to the libraries we use for model evaluation and validation, lets understand what exactly the metrics and techniques mentioned above measure and why they are important for building reliable machine-learning models.

Evaluation metrics

Accuracy : Measures the proportion of correctly classified instances out of the total instances. It is calculated as the number of correct predictions divided by the total number of predictions. Accuracy provides a general measure of how well the model performs overall. For example, in email spam detection, accuracy measures the percentage of correctly classified emails as spam or non-spam.
Precision : The proportion of correctly predicted positive instances out of all instances predicted as positive. It represents the model's ability to avoid false positive errors, indicating how precise the positive predictions are. Precision is important in scenarios where false positives are costly. For instance, in medical diagnosis, precision is crucial to accurately identify patients with a specific disease to avoid unnecessary treatments or interventions.
Recall : Also known as sensitivity or true positive rate, measures the proportion of correctly predicted positive instances out of all actual positive instances. It captures the model's ability to find all positive instances, avoiding false negatives. Recall is particularly important when the cost of false negatives is high. For example, in fraud detection, recall is essential to identify as many fraudulent transactions as possible, even if it means a higher number of false positives.
F1 score : The F1 score is the harmonic mean of precision and recall. It provides a balanced measure of the model's performance, considering both precision and recall simultaneously. The F1 score is useful when there is an uneven class distribution or when both precision and recall are equally important. For example, in information retrieval systems, the F1 score is commonly used to evaluate search algorithms, where both precision and recall are crucial in providing accurate and comprehensive search results.

Validation techniques (cross-validation)

Cross-validation helps assess a model's generalization performance and mitigate the risk of overfitting. It plays a crucial role in machine learning for the following reasons:

Performance estimation: Cross-validation provides a more reliable estimate of how well a model will perform on unseen data by evaluating it on multiple validation sets. This helps determine if the model has learned meaningful patterns or is simply memorizing the training data.
Hyperparameter tuning : Cross-validation aids in selecting the best set of hyperparameters for a model. By comparing performance across different parameter configurations, it helps identify the optimal combination that maximizes performance on unseen data.
Model selection : Cross-validation allows for a fair comparison between different models or algorithms. By evaluating their performance on multiple validation sets, it assists in choosing the most suitable model for the given problem, considering accuracy, precision, recall, or specific requirements.
Data leakage prevention : Cross-validation mitigates data leakage by creating separate validation sets that are not used during model training. This ensures a fair evaluation and avoids unintentional over-optimization based on the test set.

In real-life applications, cross-validation is particularly valuable in tasks such as credit risk assessment, where accurate predictions on unseen data are essential for decision-making.

In summary, cross-validation is essential for the development of robust models that generalize well to new instances and provides confidence in their performance outside the training data.

To evaluate and validate machine learning models, we can utilize libraries such as:

Scikit-learn : Scikit-learn offers a wide range of evaluation metrics for classification, regression, and clustering tasks. It provides functions for calculating accuracy, precision, recall, F1 score, and more. Scikit-learn also includes techniques for cross-validation, which allows for robust performance estimation.
Yellowbrick: Yellowbrick is a visualization library that integrates with Scikit-learn and provides visual tools for model evaluation and diagnostics. It offers visualizations for classification reports, learning curves, confusion matrices, and feature importances, aiding in the analysis of model performance.

So, lets take a look at how we can use some of Scikit-learns various evaluation metrics and validation techniques. Remember our previous example of a classification model? We can use Scikit-learn to evaluate the model's performance by calculating accuracy, precision, recall, and F1 score, and while we are at it, we can also use cross-validation to estimate the model's performance on unseen data.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

# Evaluate the model
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Perform cross-validation
scores = cross_val_score(model, X, y, cv=5)

5. Model deployment and monitoring

Once a satisfactory model is obtained then the exciting part begins: deploying it to production environments for real-world usage. This step involves integrating the model into an application or system and ensuring its performance is continuously monitored and optimized over time.

To deploy and monitor machine learning models, we can rely on libraries such as:

Flask: Flask is a lightweight web framework that allows us to build APIs for serving machine learning models. It provides a simple and scalable way to expose our models as web services, enabling seamless integration into applications or systems.
TensorBoard: TensorBoard is a powerful visualization tool that comes bundled with TensorFlow. It helps monitor and analyze the performance of deep learning models by providing interactive visualizations of metrics, model architectures, and training progress.
Prometheus and Grafana: Prometheus is a monitoring and alerting toolkit, while Grafana is a visualization tool. Together, they offer a robust solution for monitoring the performance and health of machine learning models in real time, providing valuable insights and enabling proactive optimization.

The choice of deployment and monitoring tools for machine learning models depends on the project and libraries you are comfortable with. For example, if you are building TensorFlow models, using TensorBoard to deploy them would be a great option.

But we are also not restricted to choosing a single library. To deploy and monitor machine learning models, we can use a combination of libraries. For instance, we can use Flask to create an API to serve the model predictions, while using Prometheus to access its monitoring and alerting capabilities, and Grafana for visualization of performance metrics. Together, they provide a robust solution for deploying and monitoring machine learning models.

from flask import Flask, request, jsonify
import prometheus_client
from prometheus_flask_exporter import PrometheusMetrics
import json

app = Flask( __name__ )
metrics = PrometheusMetrics(app)

@app.route('/predict', methods=['POST'])
def predict():
    data = json.loads(request.data)
    # Process the data and make predictions
    predictions = model.predict(data)
    return jsonify(predictions)

if __name__ == ' __main__':
    app.run()

# Monitor the model using Prometheus and Grafana...

Whats next in machine learning and Python?

In this article, we have explored the world of machine learning with Python and discussed some of the best libraries available for developing machine learning models. Python's simplicity, flexibility, and extensive library ecosystem make it an ideal choice for both beginners and experienced developers venturing into the field of machine learning.

As you embark on your machine-learning journey with Python, we encourage you to explore these libraries further. Dive into their documentation, experiment with different algorithms and techniques, and leverage the vast online resources and communities available to you.

Remember, machine learning is a rapidly evolving field, and staying up to date with the latest advancements and techniques is crucial. If youre interested in continuing, why not try training your own language model to create a personalized ChatGPT using LangChain, OpenAI, Pinecone, and Apify?

ScrapingBee review: top web scraping API?

Percival Villalva — Sun, 23 Jul 2023 22:00:00 +0000

There are lots of web scraping services out there, but which is the right choice for you? We look at ScrapingBee to see what it offers the dev looking to get data.

Whether you're building an application, conducting market research, or analyzing trends, accessing timely and accurate data is essential. However, identifying the most efficient and reliable methods for obtaining this data can be a daunting task. Should you build your own web scrapers? Use an existing web scraping API? Or go for something in between?

If you've spent some time googling around for an answer to those questions, then you've probably come across ScrapingBee But now a different question emerges. How do I know if this service is right for my use case? Well, thats precisely what we will try to answer in this article. We will review ScrapingBees service and analyze the different kinds of tools that they provide, and the pros and cons of using the service.

So, lets get started and see if ScrapingBee is worth using for your web scraping project.

ScrapingBee: what are the pros and cons?

Benefits: user-friendly web scraping API

ScrapingBee provides a user-friendly web scraping API that offers various features required for large-scale web scraping and to prevent getting blocked, including proxies and JavaScript rendering. It is recommended for developers seeking a simple solution for extracting data, which can be seamlessly integrated with their existing code for data processing.

Limitations: limited control and no integrated cloud solution

ScrapingBee's straightforward approach may be limiting for developers with advanced web scraping knowledge, as they are required to follow the rules set by ScrapingBee's API and have restricted control over the entire data extraction process.

Additionally, ScrapingBee lacks an integrated solution for managing data extraction flows in the cloud. This can be inconvenient since you would need to find a separate cloud provider or set up your own infrastructure.

ScrapingBee Proxy and API credit consumption

When it comes to large-scale data extraction, proxies are essential for circumventing anti-bot systems used by modern websites. However, utilizing proxies can significantly increase the cost of your web scraping activities. ScrapingBee's API provides several proxy options: Rotating Proxy (default), Premium Proxy, Stealth Proxy, or the ability to use your own proxy. Here is an overview of how the usage of these proxies impacts your API Credit consumption within their system:

Feature used	API credit cost/request
Rotating Proxy without JavaScript rendering	1
Rotating Proxy with JavaScript rendering (default)	5
Premium Proxy without JavaScript rendering	10
Premium Proxy with JavaScript rendering	25
Stealth Proxy with JavaScript rendering (only option available)	75

ScrapingBee pricing

The pricing of a service often plays a crucial role in our decision-making process. Fortunately, ScrapingBee provides a freemium model that allows users to try their service for free with 1,000 API credits. Their paid plans range from $49/month to $599+/month for the business plan. The key distinction between these plans is the allocation of API credits, with the base plan offering 150,000 credits and the business plans providing 8,000,000+ credits, depending on your needs. Additionally, the more expensive plans offer higher limits for concurrent requests and improved support.

ScrapingBee scraping test

ScrapingBee offers a versatile data extraction API as one of its primary services, allowing users to extract data from a wide range of web pages. To evaluate its capabilities, I decided to scrape Amazon.com, a well-known website notorious for implementing sophisticated anti-bot systems.

Navigating through ScrapingBee's API was straightforward, and the ScrapingBee documentation provided clear and updated information. With just a few lines of code, as shown in the example below, I successfully extracted the titles, prices, and links of the iPhones listed on the first page of Amazon.com:

from scrapingbee import ScrapingBeeClient # Importing SPB's client
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

response = client.get("<https://www.amazon.com/s?k=iphone&crid=1BIGRK4NGFLDS&sprefix=ipho%2Caps%2C278&ref=nb_sb_noss_2>", params={
'extract_rules':{
                 "product-titles": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style > h2 > a > span",
                     "type": "list",
                 },
                  "product-prices": {
                      "selector": "div.a-section.a-spacing-none.a-spacing-top-micro.puis-price-instructions-style > div > a > span > span.a-offscreen",
                      "type": "list",
                  },
                  "product-links": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style > h2 > a",
                     "type": "list",
                     "output": "@href"
                 },

                }
})

if response.ok:
    print(response.json())

If you want to test the provided code yourself, follow these steps:

Create a ScrapingBee account.
Replace the placeholder text in the code with your own ScrapingBee API key.

Once you have completed these steps and run the code, you can expect to see results similar to the example below printed to your terminal:

{
   "product-titles":[
      "Apple iPhone 11, 64GB, Black - Unlocked (Renewed)",
      "Apple iPhone SE (2nd Generation), 64GB, Red - Unlocked (Renewed)",
      "Apple iPhone 12, 64GB, White - Fully Unlocked (Renewed)",
      "Apple iPhone 8, 64GB, Gold - Unlocked (Renewed)",
      "Apple iPhone 12 Mini, 64GB, Black - Unlocked (Renewed)",
      "Apple iPhone X, US Version, 64GB, Silver - Unlocked (Renewed)",
      "Apple iPhone XR, 64GB, Black - Unlocked (Renewed)",
      "Apple iPhone XS, US Version, 64GB, Space Gray - Unlocked (Renewed)",
      "Apple iPhone 8 Plus, US Version, 64GB, Gold - Unlocked (Renewed)",
      "Apple iPhone 14 Pro Max, 128GB, Space Black - Unlocked (Renewed)",
      "Apple iPhone 13, 256GB, Midnight - Unlocked (Renewed)",
      "Apple iPhone 11 Pro, 64GB, Midnight Green - Unlocked (Renewed)",
      "iPhone 13 Mini, 128GB, Pink - Unlocked (Renewed)",
      "Apple iPhone 12 Pro, 256GB, Gold - Fully Unlocked (Renewed)",
      "Apple iPhone SE 3rd Gen, 64GB, Midnight - Unlocked (Renewed)",
      "Apple iPhone 14, 512GB, Purple - Unlocked (Renewed Premium)"
   ],
   "product-prices":[
      "$305.55",
      "$147.00",
      "$394.95",
      "$137.99",
      "$308.99",
      "$223.00",
      "$214.75",
      "$232.00",
      "$189.99",
      "$1,019.99",
      "$629.99",
      "$388.00",
      "$494.99",
      "$584.99",
      "$257.99",
      "$875.00"
   ],
   "product-links":[
      "/Apple-iPhone-11-64GB-Black/dp/B07ZPKN6YR/ref=sr_1_1?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-1",
      "/Apple-iPhone-SE-64GB-Red/dp/B088N8TF64/ref=sr_1_2?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-2",
      "/Apple-iPhone-12-64GB-White/dp/B08PPBQM23/ref=sr_1_3?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-3",
      "/Apple-iPhone-Fully-Unlocked-64GB/dp/B0775717ZP/ref=sr_1_4?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-4",
      "/Apple-iPhone-12-Mini-Black/dp/B08PPDJWC8/ref=sr_1_5?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-5",
      "/Apple-iPhone-Fully-Unlocked-64GB/dp/B07C357FSJ/ref=sr_1_6?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-6",
      "/Apple-iPhone-XR-Fully-Unlocked/dp/B07P6Y7954/ref=sr_1_7?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-7",
      "/Apple-iPhone-64GB-Space-Gray/dp/B07SC58QBW/ref=sr_1_8?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-8",
      "/Apple-iPhone-Plus-Fully-Unlocked/dp/B07757LZ1J/ref=sr_1_9?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-9",
      "/Apple-iPhone-14-Pro-Max/dp/B0BN94DL3R/ref=sr_1_10?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-10",
      "/Apple-iPhone-13-256GB-Midnight/dp/B09LNCVCKW/ref=sr_1_11?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-11",
      "/Apple-iPhone-64GB-Midnight-Green/dp/B07ZQRMWVB/ref=sr_1_12?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-12",
      "/Apple-iPhone-13-Mini-128GB/dp/B09LKF2RPP/ref=sr_1_13?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-13",
      "/Apple-iPhone-Pro-256GB-Gold/dp/B08PN7R2MZ/ref=sr_1_14?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-14",
      "/Apple-iPhone-SE-3rd-Midnight/dp/B0BDY71GRG/ref=sr_1_15?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-15",
      "/Apple-iPhone-14-512GB-Purple/dp/B0BYKX35NT/ref=sr_1_16?keywords=iphone&qid=1688323279&sprefix=ipho%2Caps%2C278&sr=8-16"
   ]
}

In this specific request, using ScrapingBee's API with the default configurations (Rotating Proxy and JavaScript rendering), I was charged 5 API credits. Despite making multiple requests to Amazon.com, I did not encounter any blocking issues when using the API's default settings, which is a good sign about the services reliability.

However, as our operation scales up, it is reasonable to assume that we would require more reliable and costly proxies to sustain this level of performance. So, let's see how we can enable different proxy options using ScrapingBee's API.

Using proxies in ScrapingBee

Enabling proxies in ScrapingBee is straightforward. To use a specific proxy type, you just need to include the corresponding parameter and set it to "True". For instance, to utilize the Premium Proxy, you would add "premium_proxy=True" to your response parameters, as shown below:

from scrapingbee import ScrapingBeeClient # Importing SPB's client
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

response = client.get("<https://www.amazon.com/s?k=iphone&crid=1BIGRK4NGFLDS&sprefix=ipho%2Caps%2C278&ref=nb_sb_noss_2>", params={
# Choose the proxy type you want by adding the premium_proxy, stealth_proxy or own_proxy parameters
'premium_proxy': 'True',
'extract_rules':{
                 "product-titles": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style > h2 > a > span",
                     "type": "list",
                 },
                  "product-prices": {
                      "selector": "div.a-section.a-spacing-none.a-spacing-top-micro.puis-price-instructions-style > div > a > span > span.a-offscreen",
                      "type": "list",
                  },
                  "product-links": {
                     "selector": "div.a-section.a-spacing-none.puis-padding-right-small.s-title-instructions-style > h2 > a",
                     "type": "list",
                     "output": "@href"
                 },

                }
})

if response.ok:
    print(response.json())

It's worth mentioning that enabling this option can enhance the reliability of our data extraction process by reducing the risk of our bot being blocked. However, it's important to note that this improvement comes at a higher cost per request.

For instance, in my case, using the Premium Proxy and JavaScript rendering for this request consumed 25 credits, which is a fivefold increase compared to the 5 credits spent when using the default Proxy rotation configuration.

Limitations of the ScrapingBee web scraping API

Although I was pleasantly surprised by the ease of extracting the desired data and the low incidence of blocked requests, I found it frustrating that the API had limitations when it came to more complex operations. For instance, if I were building my own scraper, I could easily handle Amazon's pagination and extract data from all the search results while maintaining complete control over the scraper's behavior. However, achieving a similar outcome using ScrapingBee's API was not immediately apparent, and their documentation lacked information on this matter.

Furthermore, the simplicity of ScrapingBee's pricing system has both positive and negative aspects. It is reassuring to know the exact number of credits each request will cost based on the chosen parameters. However, I would have appreciated a more detailed breakdown of my usage and charges within ScrapingBee's dashboard for better transparency.

Lastly, I missed having convenient access to an integrated cloud infrastructure like Apify or Zyte. While I understand that is not ScrapingBee's primary focus, having an all-in-one solution for my web scraping needs would save considerable time and effort, rather than having to search for and pay for different services to host my data extraction workflows.

Conclusion and final considerations

In conclusion, the ScrapingBee Data Extraction API offers a reliable solution for developers seeking a straightforward method to extract data from websites without the complexities of building a scraper from scratch. However, if you require a more comprehensive solution with a wider range of pre-built features and greater control over your applications and data extraction process, relying solely on ScrapingBee may not fully meet your needs.

Finally, I want to emphasize that this post serves as an introductory analysis and guide to ScrapingBee's service, assisting developers in determining if it is the right choice for them. It is important to note that not all features provided by their API have been explored in this article.

This is the first in a series of articles we commissioned from an external developer (although Percival is a former Apifier). We want to create unbiased reviews of other web scraping platforms and companies as part of our continued evaluation of the web scraping industry.

If you find yourself intrigued by ScrapingBee, I encourage you to further explore the ScrapingBee documentation for a more in-depth understanding of the platform's capabilities.

Best web scraping APIs in 2023

We explore 10 top-notch web scraping API options.

blog.apify.com

How to automate forms with JavaScript and Playwright

Percival Villalva — Sun, 16 Jul 2023 22:00:00 +0000

Whether we need to collect data from multiple sources, perform form testing, or automate mundane form submissions, learning how to submit forms with Playwright and JavaScript can help us automate these tasks.

Automating forms is easy with Playwright: follow our guide to learn why and how

Why automate forms?

Before we get into the technical details, let's consider a few scenarios where form automation can be beneficial:

Data collection

Imagine we need to gather data from various online marketplaces, including product details, prices, and more. The traditional manual approach is time-consuming and error-prone, requiring navigation to each product page and form filling. However, automation tools like Playwright enables us to automatically navigate these marketplaces, extract the necessary information, and populate a database, saving time and reducing errors.

💡 How to scrape the web with Playwright in 2023

Form testing

When developing web applications, testing the functionality and behavior of forms is crucial. Manually testing each scenario can be repetitive and inefficient. By automating testing with form submissions, input validation testing, and error handling, we ensure the smooth operation of our applications.

🔍 11 best automated browser testing tools for developers

Read about automated browser testing and the best tools for testing your web apps.

Repetitive tasks

Many online services require filling in forms repeatedly, such as submitting support requests, job applications, or entering sweepstakes. Again, automating these repetitive tasks will save time and effort. This can be done for individuals, such as if you want to repeatedly enter your details, or at scale, for companies or web scraping projects.

By the end of this article, you will have a solid understanding of how to automate forms using Playwright. You can then apply the techniques we will learn here to your own projects. So, lets get coding!

What you'll need to start automating forms

Before getting started, there are a few prerequisites you should have in place:

Basic understanding of HTML forms and browser DevTools: It's helpful to have a fundamental understanding of HTML forms and their structure as well as being able to use your browser DevTools to inspect elements on a webpage.
JavaScript knowledge: Since we'll be using JavaScript and Node.js as the programming language for our project, it's essential to have at least a basic understanding of concepts such as variables, functions, and asynchronous programming in JavaScript.
Node.js installation: Ensure that Node.js is installed on your local machine. You can download and install the latest version of Node.js from the official Node.js website and use our guide on how to install it correctly. Node.js will allow us to run JavaScript code outside of the browser environment.

You don't need to have any experience with Playwright, but you might like to find out more about it before you start.

Setting up your form automation project

Create a new project directory

Choose a suitable location on your computer and create a new directory for your project. You can name it anything you like, for example form-automation-project. So, lets open the terminal, navigate to the desired location, and use the following command to create the directory:

mkdir form-automation-project

Initialize a new Node.js project

Change into the newly created project directory and initialize a new Node.js project by running the following commands:

cd form-automation-project
npm init -y

This command generates a package.json file that will keep track of the project's dependencies and configuration.

Install Playwright

With the project initialized, we can now install the Playwright library. In your terminal or command prompt, run the following command:

npm install playwright

Update package.json to use module syntax

To enable the use of the JavaScript module syntax when building our project, add "type": "module" to your package.json file. This syntax allows us to take advantage of the ES module system, which provides a more standardized and modern approach to organizing and importing JavaScript code.

Launching the browser and navigating to the form

Now that our project is set up, we can begin automating the form by launching a browser instance and navigating to the target webpage containing the form.

Create a new JavaScript file: In the project directory, lets create a new JavaScript file to hold the logic for our bot. You can name it bot.js or choose any other suitable name.

Import Playwright modules

Open the newly created bot.js file in your preferred code editor. At the top of the file, lets import the necessary Playwright modules using the following code:

import { chromium } from 'playwright';

This imports the chromium module from Playwright, which allows us to automate Chromium-based browsers like Google Chrome.

Launch a browser instance

Below the import statement, add the following code to launch a new browser instance:

async function submitForm() => {
  const browser = await chromium.launch({ headless: false });
})();

This code uses an asynchronous function to launch a browser instance with Playwright's launch() method. The browser variable will hold the browser instance for further interactions. Note that we are also passing the parameter headless: false so we can see our bot in action.

Navigate to the target web page

To navigate to the web page containing the form, add the following code after launching the browser:

async function submitForm() => {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://www.example.com/form'); // Replace with the actual URL of the form
};

In this code, we create a new context and a new page within that context. Then, we use the goto() method to navigate to the URL of the webpage containing the form. Make sure to replace 'https://www.example.com/form' with the actual URL of the form you want to automate.

📄 If you would like to test your code on an actual form, you can follow along with this example Google Form.

Locating and filling form fields

Now that we have successfully navigated to the webpage containing the form, we can proceed to locate the form fields and fill them in with our desired values.

Locate form fields

To interact with form fields, we need to locate them on the web page. Inspect the HTML structure of the form to identify the attributes or selectors we can use to locate each field. Common attributes include name, id, and class. For example, let's assume we have an input field with the name attribute "firstName". We can locate it using Playwright's page.locator() method:

const firstNameField = page.locator('input[name="firstName"]');

For example, in our Google form example, we have to first inspect the input field we want to find the selectors that we can then use to target it:

And now we can target it by using the following code:

const firstNameField = page.locator('input[aria-labelledby="i1"]')

Repeat this approach to target all the form fields you need to fill in. For example, these are the selectors for each of the fields present in our dummy Google form:

const firstNameField = page.locator('input[aria-labelledby="i1"]');
const emailField = page.locator('input[aria-labelledby="i5"]');
const addressField = page.locator('textarea[aria-labelledby="i9"]');
const phoneField = page.locator('input[aria-labelledby="i13"]');
const commentsField = page.locator('textarea[aria-labelledby="i17"]');

Fill form fields

Once we have located a form field, we can fill it with the desired value using Playwright's fill() method. Add the following code after locating the field:

await firstNameField.fill('John');

Replace 'John' with the value you want to fill in the field.

Handle different field types

Form fields can vary in type, such as text inputs, checkboxes, radio buttons, dropdown menus, and file upload fields. Use Playwright's appropriate methods to interact with each field type. For example, to check a checkbox, use the check() method:

const checkboxField = page.locator('input[name="acceptTerms"]');
await checkboxField.check();

Remember to adjust the code based on the specific field types and actions you want to perform.

For instance, lets fill in the details for a fictitious John in our dummy form:

// Fill the form fields
await firstNameField.fill('John');
await emailField.fill('john@gmail.com');
await addressField.fill("John's Street");
await phoneField.fill('11111111');
await commentsField.fill('This form was submitted automatically.');
await checkboxField.check();

Submit the form

Once we have filled in all the necessary form fields, its time to submit our form using Playwright's submit() method. Locate the submit button and add the following code:

const submitButton = page.locator('button[type="submit"]');
await submitButton.click();

Close the browser

After we successfully submit our form it is important to explicitly tell Playwright to close the browser instance, otherwise, it would continue open even after the form is submitted.

// Close the browser
await browser.close();

Final code for automating the form

And here is what the final code for automating our dummy form looks like:

import { chromium } from 'playwright';

async function submitForm() {
    const browser = await chromium.launch({ headless: false });
    const context = await browser.newContext();
    const page = await context.newPage();

    await page.goto('https://forms.gle/7rhchFiZF2faMQxz7'); // Replace with the actual URL of the form

    await page.waitForSelector('div.lRwqcd > div', {
        state: 'visible',
    }); // Wait for form element to be visible on the page before proceeding

    // Select the form fields we want to target
    const firstNameField = page.locator('input[aria-labelledby="i1"]');
    const emailField = page.locator('input[aria-labelledby="i5"]');
    const addressField = page.locator('textarea[aria-labelledby="i9"]');
    const phoneField = page.locator('input[aria-labelledby="i13"]');
    const commentsField = page.locator('textarea[aria-labelledby="i17"]');
    const checkboxField = page.locator('div#i26');
    const submitButton = page.locator('div.lRwqcd > div');

    // Fill the form fields
    await firstNameField.fill('John');
    await emailField.fill('john@gmail.com');
    await addressField.fill("John's Street");
    await phoneField.fill('11111111');
    await commentsField.fill('This form was submitted automatically.');
    await checkboxField.check();

    // Submit the form
    await submitButton.click();

    // Close the browser
    await browser.close();
}

submitForm();

Note that in the final version of our bot, I added an extra line of code to explicitly wait for a specific element on the page to load before proceeding with the rest of the code.

await page.waitForSelector('div.lRwqcd > div', {
        state: 'visible',
    });

This step is not always necessary, but it is possible for the bot to attempt to "act" before the page is fully loaded. This can lead to an error due to the bot's inability to interact with the target element.

Automating multiple form submissions

In some cases, you may need to automate multiple form submissions, such as when you have a batch of data to process or want to simulate user interactions. Here's how you can automate multiple form submissions using Playwright:

Encapsulate form submission logic

To automate the submission of multiple forms, it is beneficial to encapsulate the form submission logic into a reusable function. We have already done this in the previous section, so we can continue using the form function we have created.

async function submitForm() {
  // Locate and fill form fields
  // Submit the form
}

Use a loop

Determine the number of times you want to submit the form and use a loop to automate the process. For example, if we want to submit the form five times, we can use a for loop as follows:

for (let i = 0; i < 5; i++) {
  await submitForm();
}

Adjust the loop conditions and the number of iterations based on your requirements.

Optional: add delays between submissions

In some cases, it may be necessary to introduce delays between form submissions to mimic user behavior or account for server response times. We can use Playwright's waitForTimeout method to add delays between submissions within the loop.

For example, we could add a 2-second delay right before we tell Playwright to close:

// Wait before closing the browser
await page.waitForTimeout(2000); // Wait for 2 seconds before the next submission

// Close the browser
await browser.close();

Adjust the delay duration as needed.

By encapsulating the form submission logic in a function and using a loop, we can automate as many form submissions as we want!

Error handling and debugging

When automating forms, it's essential to handle potential errors and have mechanisms in place for debugging. Here are some techniques we can use to do that:

Try-catch blocks

We can wrap our code inside try-catch blocks to catch and handle any errors that may occur during form automation. This allows us to gracefully handle exceptions and prevent our automation script from crashing.

try {
  // Form automation code
} catch (error) {
  console.error('An error occurred:', error);
}

By logging the error to the console or reporting it in some other way, you can quickly identify and diagnose issues.

Logging and debugging statements

Use console.log() statements strategically throughout the code to output useful information. These statements can help us track the execution flow, inspect variable values, and identify potential issues.

console.log('Filling in the first name field...');
// Your code for filling in the first name field
console.log('First name field filled successfully.');

By logging key steps or variables, you can gain insights into what's happening during the automation process.

Taking screenshots

Playwright allows us to take screenshots of the browser at any point during the automation process. Capture screenshots to visualize the state of the page and potentially identify issues or unexpected behavior.

await page.screenshot({ path: 'screenshot.png' });

Save the screenshot to a file for later examination.

Inspecting network traffic

Playwright provides tools for inspecting network traffic, which can be useful for debugging. We can intercept network requests, analyze responses, and verify data being sent and received.

page.on('response', (response) => {
  console.log('Received response:', response.url());
});

Use the page.on() method to listen for specific events related to network traffic.

Code including error handling and debugging

Now, lets update our previous code to include the error handling and debugging techniques we discussed:

import { chromium } from 'playwright';

async function submitForm() {
    const browser = await chromium.launch({ headless: false });
    const context = await browser.newContext();
    const page = await context.newPage();

    await page.goto('https://forms.gle/7rhchFiZF2faMQxz7'); // Replace with the actual URL of the form

    await page.waitForSelector('div.lRwqcd > div', {
        state: 'visible',
    });

    try {
        // Select the form fields we want to target
        const firstNameField = page.locator('input[aria-labelledby="i1"]');
        const emailField = page.locator('input[aria-labelledby="i5"]');
        const addressField = page.locator('textarea[aria-labelledby="i9"]');
        const phoneField = page.locator('input[aria-labelledby="i13"]');
        const commentsField = page.locator('textarea[aria-labelledby="i17"]');
        const checkboxField = page.locator('div#i26');
        const submitButton = page.locator('div.lRwqcd > div');

        // Fill the form fields
        console.log('Filling in the first name field...');
        await firstNameField.fill('John');
        console.log('Filling in the email field...');
        await emailField.fill('john@gmail.com');
        console.log('Filling in the address field...');
        await addressField.fill("John's Street");
        console.log('Filling in the phone number field...');
        await phoneField.fill('11111111');
        console.log('Filling in the comments field...');
        await commentsField.fill('This form was submitted automatically.');
        console.log('Checking the box ...');
        await checkboxField.check();

        // Take screenshot of the completed form
        await page.screenshot({ path: 'screenshot.png' });

        // Submit the form
        await submitButton.click();

        // Wait before closing the browser
        await page.waitForTimeout(2000); // Wait for 2 seconds before the next submission

        // Close the browser
        await browser.close();
    } catch (error) {
        console.error('Oops, something went wrong:', error);
    }
}

for (let i = 0; i < 5; i++) {
    await submitForm();
}

Advanced techniques

In this section, we'll explore some advanced techniques for form automation with Playwright. These techniques can help us handle more complex scenarios and overcome common challenges.

Handling dynamic forms

Some forms may have fields that appear or disappear dynamically based on user interactions or other factors. We can handle such forms, by employing techniques like waiting for specific elements to appear or disappear using Playwright's waitForSelector() or waitForFunction() methods.

Working with CAPTCHAs and anti-bot measures

Websites often employ CAPTCHAs or other anti-bot measures to prevent automated interactions. Automating forms that include CAPTCHAs can be challenging. Consider using third-party services or libraries specifically designed to bypass CAPTCHAs or explore browser automation techniques like mouse movements or human-like delays to mimic user behavior.

Handling file uploads

If the form includes file upload fields, we can automate file uploads using Playwright's setInputFiles() method. Specify the path to the file we want to upload, and Playwright will handle the file selection process for us.

Navigating between pages

Sometimes, form automation may require navigating between multiple pages or steps. Use Playwright's page navigation methods like goto() or click() to move between pages and perform interactions on each page as needed.

Parallelization and performance optimization

To improve performance and reduce execution time, we can employ parallelization techniques. For example, we can use multiple browser contexts or instances of Playwright to run form automation in parallel, especially when dealing with a large number of forms or submitting forms with a time-consuming process.

Remember, advanced techniques depend on the specific requirements and challenges of the forms you're automating. It's essential to understand the unique aspects of each form and apply the appropriate techniques accordingly.

That's a wrap! Now you know how to automate forms with Playwright 🦾

In this tutorial, we explored how to automate forms using Playwright and JavaScript. We covered the essential steps to build a form-filling bot and provided explanations and code examples along the way. By now, you should have a solid understanding of automating forms using Playwright!

We also discussed the importance of form automation in various real-world scenarios, including data collection, form testing, and automating repetitive tasks. By automating form submissions, you can save time, reduce errors, and increase efficiency.

Now that you have learned how to automate forms using Playwright and JavaScript, feel free to apply these techniques to your own projects and explore further possibilities for automation.

And remember to always respect the terms of service and usage policies of the websites you are automating. Use form automation responsibly and ethically, ensuring that your actions comply with legal and ethical standards. In other words, take Uncle Ben's advice to heart: " With great power comes great responsibility" 🕷

🔍 Playwright vs. Puppeteer: which is better?

Two powerful Node.js libraries: described and Playwright vs. Puppeteer: which is better?

What are the best Python web scraping libraries?

Percival Villalva — Mon, 22 May 2023 12:15:58 +0000

Introduction

Web scraping is essentially a way to automate the process of extracting data from the web, and as a Python developer, you have access to some of the best libraries and frameworks available to help you get the job done.

We're going to take a look at some of the most popular Python libraries and frameworks for web scraping and compare their pros and cons, so you know exactly what tool to use to tackle any web scraping project you might come across.

HTTP Libraries - Requests and HTTPX

First up, let's talk about HTTP libraries. These are the foundation of web scraping since every scraping job starts by making a request to a website and retrieving its contents, usually as HTML.

Two popular HTTP libraries in Python are Requests and HTTPX.

Requests is easy to use and great for simple scraping tasks, while HTTPX offers some advanced features like async and HTTP/2 support.

Their core functionality and syntax are very similar, so I would recommend HTTPX even for smaller projects since you can easily scale up in the future without compromising performance.

Feature	HTTPX	Requests
Asynchronous	✅	❌
HTTP/2 support	✅	❌
Timeout support	✅	✅
Proxy support	✅	✅
TLS verification	✅	✅
Custom exceptions	✅	❌

Parsing HTML with Beautiful Soup

Once you have the HTML content, you need a way to parse it and extract the data you're interested in.

Beautiful Soup is the most popular HTML parser in Python, allowing you to easily navigate and search through the HTML tree structure. Its straightforward syntax and easy setup also make Beautiful Soup a great option for small to medium web scraping projects as well as web scraping beginners.

The two major drawbacks of Beautiful Soup are its inability to scrape JavaScript-heavy websites and its limited scalability, which results in low performance in large-scale projects. For large projects, you would be better off using Scrapy, but more about that later.

Web scraping with Beautiful Soup and Requests

Detailed tutorial with code examples. And some handy tricks.

blog.apify.com

Next, lets take a look at how Beautiful Soup works in practice:

from bs4 import BeautifulSoup

import httpx

# Send an HTTP GET request to the specified URL using the httpx library

response = httpx.get("<https://news.ycombinator.com/news>")

# Save the content of the response

yc_web_page = response.content

# Use the BeautifulSoup library to parse the HTML content of the webpage

soup = BeautifulSoup(yc_web_page)

# Find all elements with the class "athing" (which represent articles on Hacker News) using the parsed HTML

articles = soup.find_all(class_="athing")

# Loop through each article and extract relevant data, such as the URL, title, and rank

for article in articles:

data = {

"URL": article.find(class_="titleline").find("a").get('href'), # Find the URL of the article by finding the first "a" tag within the element with class "titleline"

"title": article.find(class_="titleline").getText(), # Find the title of the article by getting the text content of the element with class "titleline"

"rank": article.find(class_="rank").getText().replace(".", "") # Find the rank of the article by getting the text content of the element with class "rank" and removing the period character

}

# Print the extracted data for the current article

print(data)

Explaining the code:

1 - We start by sending an HTTP GET request to the specified URL using the HTTPX library. Then, we save the retrieved content to a variable.

2 - Now, we use the Beautiful Soup library to parse the HTML content of the webpage.

3 - This enables us to manipulate the parsed content using Beautiful Soup methods, such as find_all to find the content we need. In this particular case, we are finding all elements with the class athing, which represents articles on Hacker News.

4- Next, we simply loop through all the articles on the page and then use CSS selectors to further specify what data we would to extract from each article. Finally, we print the scraped data to the console.

Browser automation libraries - Selenium and Playwright

What if the website you're scraping relies on JavaScript to load its content? In that case, an HTML parser won't be enough, as you'll need to generate a browser instance to load the pages JavaScript using a browser automation tool like Selenium or Playwright.

These are primarily testing and automation tools that allow you to control a web browser programmatically, including clicking buttons, filling out forms, and more. However, they are also often used in web scraping as a means to access dynamically generated data on a webpage.

While Selenium and Playwright are very similar in their core functionality, Playwright is more modern and complete than Selenium.

For example, Playwright offers some unique built-in features, such as automatically waiting on elements to be visible before making actions and an asynchronous version of its API using asyncIO.

What is Playwright automation?

Learn why Playwright is ideal for web scraping and automation.

blog.apify.com

To exemplify how we can use Playwright to do web scraping, lets quickly walk through a code snippet where we use Playwright to extract data from an Amazon product and save a screenshot of the page while at it.

import asyncio

from playwright.async_api import async_playwright

async def main():

async with async_playwright() as p:

browser = await p.firefox.launch(headless=False)

page = await browser.new_page()

await page.goto("<https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C>")

# Create a dictionary with the scraped data

selectors = ['#productTitle', 'span.author a', '#productSubtitle', '.a-size-base.a-color-price.a-color-price']

book_data = await asyncio.gather(*(page.query_selector(sel) for sel in selectors))

book = {}

book["book_title"], book["author"], book["edition"], book["price"] = [await elem.inner_text() for elem in book_data if elem]

print(book)

await page.screenshot(path="book.png")

await browser.close()

asyncio.run(main())

Explaining the code:

Import the necessary modules: asyncio and async_playwright from Playwright's async API.
After importing the necessary modules, we start by defining an async function called main that launches a Firefox browser instance with headless mode set to False so we can actually see the browser working. Creates a new page in the browser using the new_page method and finally navigates to the Amazon website using the gotomethod.
Next, we define a list of CSS selectors for the data we want to be scraped. Then, we can use the method asyncio.gather to simultaneously execute the page.query_selector method on all the selectors in the list, and store the results in a book_data variable.
Now we can iterate over book_data to populate the book dictionary with the scraped data. Note that we also check that the element is not None and only add the elements which exist. This is considered good practice since websites can make small changes that will affect your scraper. You could even expand on this example and write more complex tests to ensure the data being extracted is not missing any values.
Finally, we print the book dictionary contents to the console and take a screenshot of the scraped page, saving it as a file called book.png.
As a last step, we make sure to close the browser instance.

How to scrape the web with Playwright in 2023

Complete Playwright web scraping and crawling tutorial.

blog.apify.com

But wait! If browser automation tools can be used to scrape virtually any webpage and, on top of that, can also make it easier for you to automate tasks, test and visualize your code working, why dont we just always use Playwright or Selenium for web scraping?

Well, despite being powerful scraping tools, these libraries and frameworks have a noticeable drawback. It turns out that generating a browser instance is a very resource-heavy action when compared to simply retrieving the pages HTML. This can easily become a huge performance bottleneck for large scraping jobs, which will not only take longer to complete but also become considerably more expensive. For that reason, we usually want to limit the usage of these tools to only the necessary tasks and, when possible, use them together with Beautiful Soup or Scrapy.

Web Scraping with Scrapy

A hands-on guide for web scraping with Scrapy.

blog.apify.com

Scrapy

Next up, we have the most popular and, arguably, powerful web scraping framework for Python.

If you find yourself needing to scrape large amounts of data regularly, then Scrapy could be a great option.

The Scrapy framework offers a full-fledged suite of tools to aid you even in the most complex scraping jobs.

On top of its superior performance when compared to Beautiful Soup, Scrapy can also be easily integrated into other data-processing Python tools and even other libraries, such as Playwright.

Not only that, but it comes with a handy collection of built-in features catered specifically to web scraping, such as:

Feature	Description
Powerful and flexible spidering framework	Scrapy provides a built-in spidering framework that allows you to easily define and customize web crawlers to extract the data you need.
Fast and efficient	Scrapy is designed to be fast and efficient, allowing you to extract data from large websites quickly and with minimal resource usage.
Support for handling common web data formats	Export data in multiple formats such as HTML, XML, and JSON.
Extensible architecture	Easily add custom functionality through middleware, pipelines, and extensions.
Distributed scraping	Scrapy supports distributed scraping, allowing you to scale up your web scraping operation across multiple machines.
Error handling	Scrapy has robust error-handling capabilities, allowing you to handle common errors and exceptions that may occur during web scraping.
Support for authentication and cookies	Supports handling authentication and cookies to scrape websites that require login credentials.
Integration with other Python tools	Scrapy can be easily integrated with other Python tools, such as data processing and storage libraries, making it a powerful tool for end-to-end data processing pipelines.

Here's an example of how to use a Scrapy Spider to scrape data from a website:

import scrapy

class HackernewsSpiderSpider(scrapy.Spider):
    name = 'hackernews_spider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['<http://news.ycombinator.com/>']

    def parse(self, response):
        articles = response.css('tr.athing')
        for article in articles:
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
        }

We can use the following command to run this script and save the resulting data to a JSON file:

scrapy crawl hackernews -o hackernews.json

Explaining the code:

The code example uses Scrapy to scrape data from the Hacker News website (news.ycombinator.com). Let's break down the code step by step:

After importing the necessary modules, we define the Spider class we want to use:


class HackernewsSpiderSpider(scrapy.Spider):

Next, we set the Spider properties:

name: The name of the spider (used to identify it).
allowed_domains: A list of domains that the spider is allowed to crawl
start_urls: A list of URLs to start crawling from.


name = 'hackernews_spider'
allowed_domains = ['news.ycombinator.com']
start_urls = ['<http://news.ycombinator.com/>']

Then, we define the parse method: This method is the entry point for the spider and is called with the response of the URLs specified in start_urls.


def parse(self, response):

In the parse method, we will extract data from the HTML response: The response object represents the HTML page received from the website. The spider uses CSS selectors to extract relevant data from the HTML structure.


articles = response.css('tr.athing')

Now we use a for loop to iterate over each article found on the page.


for article in articles:

Finally, for each article, the spider extracts the URL, title, and rank information using CSS selectors and yields a Python dictionary containing this data.


yield {
    "URL": article.css(".titleline a::attr(href)").get(),
    "title": article.css(".titleline a::text").get(),
    "rank": article.css(".rank::text").get().replace(".", "")
}

Scrapy alternatives: other web scraping libraries to try

5 Scrapy alternatives for web scraping you need to try.

blog.apify.com

Which Python scraping library is right for you?

So, which library should you use for your web scraping project? The answer depends on the specific needs and requirements of your project. Each web scraping library and framework presented here has a unique purpose in an expert scraper's toolkit. Learning to use each one will give you the flexibility to select the best tool for each job, so don't be afraid to try each of them before deciding!

Whether you are scraping with BeautifulSoup, Scrapy, Selenium, or Playwright, the Apify Python SDK helps you run your project in the cloud at any scale.

How to parse JSON with Python

Percival Villalva — Thu, 18 May 2023 14:04:51 +0000

Understand JSON structure and syntax, and learn how to parse JSON strings and files using Python's built-in json module and convert JSON files using Pandas.

What is JSON?

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write while also being easy for machines to parse and generate. It is widely used for transmitting data between a client and a server, as an alternative to XML.

JSON data is represented as a collection of key-value pairs, where the keys are strings and the values can be any valid JSON data type, such as a string, number, boolean, null, array, or object.

{
    "name": "John Doe",
    "age": 30,
    "city": "New York"
}

In this example, name, age, and city are the keys, and "John Doe", 30, and "New York" are the corresponding values.

How to parse JSON strings in Python

To parse a JSON string in Python, we can use the built-in json module. This module provides two methods for working with JSON data:

json.loads() parses a JSON string and returns a Python object.
json.dumps() takes a Python object and returns a JSON string.

Here is an example of how to use json.loads() to parse a JSON string:

import json

# JSON string
json_str = '{"name": "John", "age": 30, "city": "New York"}'

# parse JSON string
data = json.loads(json_str)

# print Python object
print(data)

In this example, we import the json module, define a JSON string, and use json.loads() to parse it into a Python object. We then print the resulting Python object.

Note that json.loads() will raise a json.decoder.JSONDecodeError exception if the input string is not valid JSON.

After running the script above we can expect to get the following output printed to the console:

{'name': 'John', 'age': 30, 'city': 'New York'}

How to read and parse JSON files in Python

To parse a JSON file in Python, we can use the same json module we used in the previous section. The only difference is that instead of passing a JSON string to json.loads(), we pass the contents of a JSON file.

For example, assume we have a file named **data.json** that we would like to parse and read. Here's how we would do it:

import json

# open JSON file
with open('data.json', 'r') as f:
    # parse JSON data
    data = json.load(f)

# print Python object
print(data)

In this example, we use the open() function to open a JSON target file called data.json in read mode. We then pass the file object to json.load(), which parses the JSON data and returns a Python object. We then print the resulting Python object.

Note that if the JSON file is not valid JSON, json.load() will raise a json.decoder.JSONDecodeError exception.

How to pretty print JSON data in Python

When working with JSON data in Python, it can often be helpful to pretty print the data, which means to format it in a more human-readable way. The json module provides a method called json.dumps() that can be used to pretty print JSON data.

Here is an example of how to pretty print JSON data in Python:

import json

# define JSON data
data = {
    "name": "John",
    "age": 30,
    "city": "New York",
    "hobbies": ["reading", "traveling", "cooking"]
}

# pretty print JSON data
pretty_json = json.dumps(data, indent=4)

# print pretty JSON
print(pretty_json)

Output:

{
    "name": "John",
    "age": 30,
    "city": "New York",
    "hobbies": [
        "reading",
        "traveling",
        "cooking"
    ]
}

In this example, we define a Python dictionary representing JSON data, and then use json.dumps() with the indent argument set to 4 to pretty print the data. We then print the resulting pretty printed JSON string.

Note that indent is an optional argument to json.dumps() that specifies the number of spaces to use for indentation. If indent is not specified, the JSON data will be printed without any indentation.

How to parse JSON with Python Pandas

In addition to the built-in json package, we can also use pandas to parse and work with JSON data in Python. pandas provides a method called pandas.read_json() that can read JSON data into a DataFrame.

Compared to using the built-in json package, working with pandas can be easier and more convenient when we want to analyze and manipulate the data further, as it allows us to use the powerful and flexible DataFrame object.

Here is an example of how to parse JSON data with pandas:

import pandas as pd
import json

# define JSON data
data = {
    "name": ["John", "Jane", "Bob"],
    "age": [30, 25, 35],
    "city": ["New York", "London", "Paris"]
}

# convert JSON to DataFrame using pandas
df = pd.read_json(json.dumps(data))

# print DataFrame
print(df)

Output:


   name age city
0 John 30 New York
1 Jane 25 London
2 Bob 35 Paris

One benefit of using pandas to parse JSON data is that we can easily manipulate the resulting DataFrame, for example by selecting columns, filtering rows, or grouping data.

import pandas as pd
import json

# define JSON data
data = {
    "name": ["John", "Jane", "Bob"],
    "age": [30, 25, 35],
    "city": ["New York", "London", "Paris"]
}

# convert JSON to DataFrame using pandas
df = pd.read_json(json.dumps(data))

# select columns
df = df[["name", "age"]]

# filter rows
df = df[df["age"] > 30]

# print resulting DataFrame
print(df)

Output:

  name age
2 Bob 35

In this example, we select only the name and age columns from the DataFrame, and filter out any rows where the age is less than or equal to 30.

Using pandas to parse and work with JSON data in Python can be a convenient and powerful alternative to using the built-in json package. It allows us to easily manipulate and analyze the data using the DataFrame object, which offers a rich set of functionality for working with tabular data.

How to convert JSON to CSV in Python

Sometimes we might want to convert JSON data into a CSV format. Luckily, the pandas library can also help us with that.

We can use the pandas.read_json() to read JSON data into a DataFrame, followed by a method called DataFrame.to_csv() to write the DataFrame to a CSV file.

Here is an example of how to convert JSON data to CSV in Python using pandas:

import pandas as pd

# define JSON data
data = {
    "name": ["John", "Jane", "Bob"],
    "age": [30, 25, 35],
    "city": ["New York", "London", "Paris"]
}

# convert JSON to DataFrame
df = pd.read_json(json.dumps(data))

# write DataFrame to CSV file
df.to_csv("data.csv", index=False)

# read CSV file
df = pd.read_csv("data.csv")

# print DataFrame
print(df)

Output:

   name age city
0 John 30 New York
1 Jane 25 London
2 Bob 35 Paris

In this example, we define a Python dictionary representing JSON data, and use json.dumps() to convert it to a JSON string. We then use pandas.read_json() to read the JSON string into a DataFrame, and use DataFrame.to_csv() to write the DataFrame to a CSV file. We then use pandas.read_csv() to read the CSV file back into a DataFrame, and print the resulting DataFrame.

Note that when calling to_csv(), we pass index=False to exclude the row index from the output CSV file.

Python web scraping tutorial

How to scrape & parse data with Python (with code examples)

blog.apify.com

Web Scraping with Scrapy

Percival Villalva — Mon, 17 Apr 2023 18:57:21 +0000

👋 Introduction

What is Scrapy?

Scrapy is an open-source web scraping framework written in Python that provides an easy-to-use API for web scraping, as well as built-in functionality for handling large-scale web scraping projects, support for different types of data extraction, and the ability to work with different web protocols.

Why use Scrapy?

Scrapy is the preferred tool for large-scale scraping projects due to its advantages over other popular Python web scraping libraries such as BeautifulSoup.

BeautifulSoup is primarily a parser library, whereas Scrapy is a complete web scraping framework with handy built-in functionalities such as dedicated spider types for different scraping tasks and the ability to extend Scrapys functionality by using middleware and exporting data to different formats.

Some real-world examples where Scrapy can be useful include:

E-commerce websites: Scrapy can be used to extract product information such as prices, descriptions, and reviews from e-commerce websites such as Amazon, Walmart, and Target.
Social media: Scrapy can be used to extract data such as public user information and posts from popular social media websites like Twitter, Facebook, and Instagram.
Job boards: Scrapy can be used to monitor job board websites such as Indeed, Glassdoor, and LinkedIn for relevant job postings.

It's important to note that Scrapy has some limitations. For example, it cannot scrape JavaScript-heavy websites. However, we can easily overcome this limitation by using Scrapy alongside other tools like Selenium or Playwright to tackle those sites.

Scrapy vs. Beautiful Soup for web scraping

Learn the differences between these Python scraping libraries.

blog.apify.com

Alright, now that we have a good idea of what Scrapy is and why it's useful, let's dive deeper into Scrapy's main features.

🎁 Exploring Scrapy Features

Types of Spiders 🕷

One of the key features of Scrapy is the ability to create different types of spiders. Spiders are essentially the backbone of Scrapy and are responsible for parsing websites and extracting data. There are three main types of spiders in Scrapy:

Spider : The base class for all spiders. This is the simplest type of spider and is used for extracting data from a single page or a small set of pages.
CrawlSpider : A more advanced type of spider that is used for extracting data from multiple pages or entire websites. CrawlSpider automatically follows links and extracts data from each page it visits.
SitemapSpider : A specialized type of spider that is used for extracting data from websites that have a sitemap.xml file. SitemapSpider automatically visits each URL in the sitemap and extracts data from it.

Here is an example of how to create a basic Spider in Scrapy:


import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["<http://example.com>"]

    def parse(self, response):
        # extract data from response

This spider, named myspider, will start by requesting the URL http://example.com. The parse method is where you would write code to extract data from the response.

Here is an example of how to create a CrawlSpider in Scrapy:


import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class MyCrawlSpider(CrawlSpider):
    name = "mycrawlspider"
    start_urls = ["<http://example.com>"]

    rules = [
        Rule(LinkExtractor(), callback='parse_item', follow=True)
    ]

    def parse_item(self, response):
        # extract data from response

This spider, named mycrawlspider, will start by requesting the URL http://example.com. The rules list contains one Rule object that tells the spider to follow all links and call the parse_item method on each response.

Extending Scrapy with Middlewares 🔗

Middlewares allow us to extend Scrapys functionality. Scrapy comes with several built-in middlewares that can be used out of the box.

Additionally, we can also write your own custom middleware to perform tasks like modifying request headers, logging, or handling exceptions. So, lets take a look at some of the most commonly used Scrapy middlewares:

UserAgentMiddleware: This middleware allows you to set a custom User-Agent header for each request. This is useful for avoiding detection by websites that may block scraping bots based on the User-Agent header. To use this middleware, we can set it up on our Scrapy settings file like this:


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.useragent': 500,
}

In this example, we're using a priority of 500 for the built-in UserAgentMiddleware to ensure that it runs before other downloader middlewares.

By default, UserAgentMiddleware sets the User-Agent header for each request to a randomly chosen user-agent string. You can customize the user agent strings used by setting the USER_AGENT setting in your Scrapy settings.

Note that we first set UserAgentMiddleware to None before adding it to the DOWNLOADER_MIDDLEWARES setting with a different priority.

This is because the default UserAgentMiddleware in Scrapy sets a generic user agent string for all requests, which may not be ideal for some scraping scenarios. If we need to use a custom user agent string, we'll need to customize the UserAgentMiddleware.

Therefore, by setting UserAgentMiddleware to None first, we're telling Scrapy to remove the default UserAgentMiddleware from the DOWNLOADER_MIDDLEWARES setting before adding our own custom instance of the middleware with a different priority.

RetryMiddleware: Scrapy comes with a RetryMiddleware that can be used to retry failed requests. By default, it retries requests with HTTP status codes 500, 502, 503, 504, 408, and when an exception is raised. You can customize the behavior of this middleware by specifying the RETRY_TIMES and RETRY_HTTP_CODES settings. To use this middleware in its default configuration, you can simply add it to your Scrapy settings:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
}

HttpProxyMiddleware: This middleware allows you to use proxies to send requests. This is useful for avoiding detection and bypassing IP rate limits. To use this middleware, we can add it to our Scrapy settings file like this:


DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'myproject.middlewares.ProxyMiddleware': 100,
}

PROXY_POOL_ENABLED = True

This will enable the HttpProxyMiddleware and also enable the ProxyMiddleware that we define. This middleware will select a random proxy for each request from a pool of proxies provided by the user.

CookiesMiddleware: This middleware allows you to handle cookies sent by websites. By default, Scrapy stores cookies in memory, but you can also store them in a file or a database by specifying the COOKIES_STORAGE in the Scrapy settings. To add **CookiesMiddleware**to the **DOWNLOADER_MIDDLEWARES**setting, we simply specify the middleware class and its priority. In this case, we're using a priority of 700, which should be after the default UserAgentMiddleware and **RetryMiddleware**but before any custom middleware.

DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    }

Now we can use CookiesMiddleware to handle cookies sent by the website:

from scrapy import Spider, Request

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['<https://www.example.com/>']

    def start_requests(self):
        for url in self.start_urls:
            # Send an initial request without cookies
            yield Request(url, cookies={}, callback=self.parse)

    def parse(self, response):
        # Extract cookies from the response headers
        cookies = {}
        for cookie in response.headers.getlist('Set-Cookie'):
            key, value = cookie.decode('utf-8').split('=', 1)
            cookies[key] = value.split(';')[0]

        # Send a new request with the cookies received
        yield Request('<https://www.example.com/protected>', cookies=cookies, callback=self.parse_protected)

    def parse_protected(self, response):
        # Process the protected page here
        pass

When the spider sends an initial request to https://www.example.com/, we're not sending any cookies yet. When we receive the response, we extract the cookies from the response headers and send a new request to a protected page with the received cookies.

These are just a few of the uses for middlewares in Scrapy. The beauty of middlewares is that we are able to write our own custom middleware to continue expanding Scrapys features and performing additional tasks to fit our specific use cases.

Exporting Scraped Data 📤

Scrapy provides built-in support for exporting scraped data in different formats, such as CSV, JSON, and XML. You can also create your own custom exporters to store data in different formats.

Heres an example of how to store scraped data in a CSV file in Scrapy:

Note that this is a very basic example, and the closed method could be modified to handle errors and ensure that the file is closed properly. Also, the code is merely explanatory, and you will have to adapt it to make it work for your use case.

import scrapy
from scrapy.exporters import CsvItemExporter

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['<https://www.example.com>']

    def parse(self, response):
        items = response.xpath('//div[@class="item"]')
        for item in items:
            yield {
                'title': item.xpath('.//h2/text()').get(),
                'description': item.xpath('.//p/text()').get(),
            }

    def closed(self, reason):
        filename = "example.csv"
        with open(filename, 'w+b') as f:
            exporter = CsvItemExporter(f)
            exporter.fields_to_export = ['title', 'description']
            exporter.export_item(item for item in self.parse())

In this example, we define a spider that starts by scraping the "https://www.example.com" URL. We then define a parse method that extracts the title, price, and description for each item on the page. Finally, in the closed method, we define a filename for the CSV file and export the scraped data using the CsvItemExporter.

Another way of exporting extracted data in different formats using Scrapy is to use the scrapy crawl command and specify the desired file format of our output. This can be done by appending the -o flag followed by the filename and extension of the output file.

For example, if we want to output our scraped data in JSON format, we would use the following command:

scrapy crawl myspider -o output.json

This will store the scraped data in a file named output.json in the same directory where the command was executed. Similarly, if we want to output the data in CSV format, we would use the following command:

scrapy crawl myspider -o output.csv

This will store the scraped data in a file named output.csv in the same directory where the command was executed.

Overall, Scrapy provides multiple ways to store and export scraped data, giving us the flexibility to choose the most appropriate method for our particular situation.

Now that we have a better understanding of what is possible with Scrapy, let's explore how we can use this framework to extract data from real websites. We'll do this by building a few small projects, each showcasing a different Scrapy feature.

🛠 Project: Building a Hacker News Scraper using a basic Spider

In this section, we will learn how to set up a Scrapy project and create a basic Spider to scrape the title, author, URL, and points of all articles displayed on the first page of the Hacker News website.

Creating a Scrapy Project

Before we can generate a Spider, we need to create a new Scrapy project. To do this, we'll use the terminal. Open a terminal window and navigate to the directory where you want to create your project. Start by installing Scrapy:

pip install scrapy

Then run the following command:

scrapy startproject hackernews

This command will create a new directory called "hackernews" with the basic structure of a Scrapy project.

Creating a Spider

Now that we have a Scrapy project set up, we can create a spider to scrape the data we want. In the same terminal window, navigate to the project directory using cd hackernews and run the following command:

scrapy genspider hackernews_spider news.ycombinator.com

This command will create a new spider in the spiders directory of our project. We named the spider hackernews_spider and set the start URL to news.ycombinator.com, which is our target website.

Writing the Spider Code

Next, lets open the hackernews_spider.py file in the spiders directory of our project. We'll see a basic template for a Scrapy Spider.

import scrapy

class HackernewsSpiderSpider(scrapy.Spider):
    name = 'hackernews_spider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['http://news.ycombinator.com/']

    def parse(self, response):
        pass

Before we move on, lets quickly break down what were seeing:

name attribute is the name of the Spider.
allowed_domains attribute is a list of domains that the Spider is allowed to scrape.
start_urls attribute is a list of URLs that the Spider should start scraping from
parse method is the method that Scrapy calls to handle the response from each URL in the start_urls list.

Cool, now for the fun part. Let's add some code to the parse method to scrape the data we want.

import scrapy

class HackernewsSpiderSpider(scrapy.Spider):
    name = 'hackernews_spider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['http://news.ycombinator.com/']

    def parse(self, response):
        articles = response.css('tr.athing')
        for article in articles:
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
        }

In this code, we use the css method to extract data from the response. We select all the articles on the page using the CSS selector tr.athing, and then we extract the title , URL , and rank for each article using more specific selectors. Finally, we use the yield keyword to return a Python dictionary with the scraped data.

Running the Hacker News Spider

Now that our Spider is ready, let's run it and see it in action.

By default, the data is output to the console, but we can also export it to other formats, such as JSON, CSV, or XML, by specifying the output format when running the scraper. To demonstrate that, lets run our Spider and export the extracted data to a JSON file:

scrapy crawl hackernews -o hackernews.json

This will save the data to a file named **hackernews.json** in the root directory of the project. You can use the same command to export the data to other formats by replacing the file extension with the desired format (e.g., o hackernews.csv for CSV format).

That's it for running the spider. In the next section, we'll take a look at how we can use Scrapy's CrawlSpider to extract data from all pages on the Hacker News website.

🛠 Project: Building a Hacker News Scraper using the CrawlSpider

The previous section demonstrated how to scrape data from a single page using a basic Spider. While it is possible to write code to paginate through the remaining pages and scrape all the articles on HN using the basic Spider, Scrapy offers us a better solution: the CrawlSpider. So, without further ado, lets jump straight into the code.

Project Setup

To start, let's create a new Scrapy project called hackernews_crawlspider using the following command in your terminal:

scrapy startproject hackernews_crawlspider

Next, let's create a new spider using the CrawlSpider template. The CrawlSpider is a subclass of the Spider class and is designed for recursively following links and scraping data from multiple pages.

scrapy genspider -t crawl hackernews_spider https://news.ycombinator.com/

This command generates a new spider called "hackernews_spider" in the "spiders" directory of your Scrapy project. It also specifies that the spider should use the CrawlSpider template and start by scraping the homepage of Hacker News.

Code

Our goal with this scraper is to extract the same data from each article that we scraped in the previous section: URL, title, and rank. The difference is that now we will define a set of rules for the scraper to follow when crawling through the website. For example, we will define a rule to tell the scraper where it can find the correct links to paginate through the HN content.

With this in mind, thats what the final code for our use case will look like:

# Add imports CrawlSpider, Rule and LinkExtractor 👇
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# Change the spider from "scrapy.Spider" to "CrawlSpider"
class HackernewsSpider(CrawlSpider):
    name = 'hackernews'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['<https://news.ycombinator.com/news>']

    custom_settings = {
        'DOWNLOAD_DELAY': 1 # Add a 1-second delay between requests
    }

    # Define a rule that should be followed by the link extractor. 
    # In this case, Scrapy will follow all the links with the "morelink" class
    # And call the "parse_article" function on every crawled page
    rules = (
        Rule(LinkExtractor(allow=[r'news\\.ycombinator\\.com/news$']), callback='parse_article'),
        Rule(LinkExtractor(restrict_css='.morelink'), callback='parse_article', follow=True),
    )

    # When using the CrawlSpider we cannot use a parse function called "parse".
    # Otherwise, it will override the default function.
    # So, just rename it to something else, for example, "parse_article"
    def parse_article(self, response):
        for article in response.css('tr.athing'):
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
            }

Now lets break down the code to understand what the CrawlSpider is doing for us in this scenario.

You may notice that some parts of this code were already generated by the CrawlSpider, while other parts are very similar to what we did when writing the basic Spider.

The first distinctive piece of code that may catch your attention is the custom_settings attribute we have included. This adds a 1-second delay between requests. Since we are now sending multiple requests to access different pages on the website, having this additional delay between the requests can be useful in preventing the target website from being overwhelmed with too many requests at once.

Next, we defined a set of rules to follow when crawling the website using the rules attribute:

rules = (
        Rule(LinkExtractor(allow=[r'news\\.ycombinator\\.com/news$']), callback='parse_article'),
        Rule(LinkExtractor(restrict_css='.morelink'), callback='parse_article', follow=True),
    )

Each rule is defined using the Rule class, which takes two arguments: a LinkExtractor instance that defines which links to follow; and a callback function that will be called to process the response from each crawled page. In this case, we have two rules:

The first rule uses a LinkExtractor instance with an allow parameter that matches URLs that end with "news.ycombinator.com/news". This will match the first page of news articles on Hacker News. We set the callback parameter to parse_article, which is the function that will be called to process the response from each page that matches this rule.
The second rule uses a LinkExtractor instance with a restrict_css parameter that matches the morelink class. This will match the "More" link at the bottom of each page of news articles on Hacker News. Again, we set the callback parameter to parse_article and the follow parameter to True, which tells Scrapy to follow links on this page that match the provided selector.

Finally, we defined the parse_article function, which takes a response object as its argument. This function is called to process the response from each page that matches one of the rules defined in the rules attribute.

def parse_article(self, response):
        for article in response.css('tr.athing'):
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
            }

In this function, we use the response.css method to extract data from the HTML of the page. Specifically, we look for all tr elements with the athing class and extract the URL, title, and rank of each article. We then use the yield keyword to return a Python dictionary with this data.

Remember that the yield keyword is used instead of return because Scrapy processes the response asynchronously, and the function can be called multiple times.

It's also worth noting that we've named the function parse_article instead of the default parse function that's used in Scrapy Spiders. This is because when you use the CrawlSpider class, the default parse function is used to parse the response from the first page that's crawled. If you define your own parse function in a CrawlSpider, it will override the default function, and your spider will not work as expected.

To avoid this problem, its considered good practice to always name our custom parsing functions something other than parse. In this case, we've named our function parse_article, but you could choose any other name that makes sense for your Spider.

Running the CrawlSpider

Great, now that we understand whats happening in our code, its time to put our spider to the test by running it with the following command:

scrapy crawl hackernews -o hackernews.json

This will start the spider and scrape data from all the news items on all pages of the Hacker News website. We also already took the opportunity to tell Scrapy to output all the scraped data to a JSON file, which will make it easier for us to visualize the obtained results.

🕸 How to scrape JavaScript-heavy websites

Scraping JavaScript-heavy websites can be a challenge with Scrapy alone since Scrapy is primarily designed to scrape static HTML pages. However, we can work around this limitation by using a headless browser like Playwright in conjunction with Scrapy to scrape dynamic web pages.

Playwright is a library that provides a high-level API to control headless Chrome, Firefox, and Safari. By using Playwright, we can programmatically interact with our target web page to simulate user actions and extract data from dynamically loaded elements.

To use Playwright with Scrapy, we have to create a custom middleware that initializes a Playwright browser instance and retrieves the HTML content of a web page using Playwright. The middleware can then pass the HTML content to Scrapy for parsing and extraction of data.

Luckily, the scrapy-playwright library lets us easily integrate Playwright with Scrapy. In the next section, we will build a small project using this combo to extract data from a JavaScript-heavy website, Mint Mobile. But before we move on, lets first take a quick look at the target webpage and understand why we wouldnt be able to extract the data we want with Scrapy alone.

Mint Mobile requires JavaScript to load a considerable part of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping:

Mint Mobile product page with JavaScript disabled:

Mint Mobile product page with JavaScript enabled:

As you can see, without JavaScript enabled, we would lose a significant portion of the data we want to extract. Since Scrapy cannot load JavaScript, you could think of the first image with JavaScript disabled as the "Scrapy view," while the second image with JavaScript enabled would be the "Playwright view.

Cool, now that we know why we need a browser automation library like Playwright to scrape this page, it is time to translate this knowledge into code by building our next project: the Mint Mobile scraper.

🛠 Project: Building a web scraper using Scrapy and Playwright

In this project, well scrape a specific product page from the Mint Mobile website: https://www.mintmobile.com/product/google-pixel-7-pro-bundle/

Project setup

We start by creating a directory to house our project and installing the necessary dependencies:

# Create new directory and move into it
mkdir scrapy-playwright
cd scrapy-playwright

Installation:

# Install Scrapy and scrapy-playwright
pip install scrapy scrapy-playwright

# Install the required browsers if you are running Playwright for the first time
playwright install

Next, we start the Scrapy project and generate a spider:

scrapy startproject scrapy_playwright_project
scrapy genspider mintmobile https://www.mintmobile.com/

Now let's activate scrapy-playwright by adding a few lines of configuration to our DOWNLOAD_HANDLERS middleware.

# scrapy-playwright configuration

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

Great! Were now ready to write some code to scrape our target website.

Code

import scrapy
from scrapy_playwright.page import PageMethod

class MintmobileSpider(scrapy.Spider):
    name = 'mintmobile'

    def start_requests(self):
        yield scrapy.Request('<https://www.mintmobile.com/product/google-pixel-7-pro-bundle/>',
        meta= dict(
            # Use Playwright
            playwright = True,
            # Keep the page object so we can work with it later on
            playwright_include_page = True,
            # Use PageMethods to wait for the content we want to scrape to be properly loaded before extracting the data
            playwright_page_methods = [
                PageMethod('wait_for_selector', 'div.m-productCard--device')
                ]
        ))

    def parse(self, response):
        yield {
            "name": response.css("div.m-productCard__heading h1::text").get().strip(),
            "memory": response.css("div.composited_product_details_wrapper > div > div > div:nth-child(2) > div.label > span::text").get().replace(':', '').strip(),
            "pay_monthly_price": response.css("div.composite_price_monthly > span::text").get(),
            "pay_today_price": response.css("div.composite_price p.price span.amount::attr(aria-label)").get().split()[0],
    };

In the start_requests method, the spider makes a single HTTP request to the mobile phone product page on the Mint Mobile website. We initialize this request using the scrapy.Request class while passing a meta dictionary setting the options we want to use for Playwright to scrape the page. These options include playwright set to True to indicate that Playwright should be used, followed by playwright_include_page also set to True to enable us to save the page object so that it can be used later, and playwright_page_methods set to a list of PageMethod objects.

In this case, theres only one PageMethod object, which uses Playwright's wait_for_selector method to wait for a specific CSS selector to appear on the page. This is done to ensure that the page has properly loaded before we start extracting its data.

In the parse method, the spider uses CSS selectors to extract data from the page. Four pieces of data are extracted: the name of the product, its memory capacity, the pay_monthly_price, as well as the pay_today_price.

Expected output:

Finally, lets run our spider using the command scrapy crawl mintmobile -o data.json to scrape the target data and store it in a data.json file:

[
    {
        "name": "Google Pixel 7 Pro",
        "memory": "128GB",
        "pay_monthly_price": "50",
        "pay_today_price": "589"
    }
]

Deploying Scrapy spiders to the cloud

Next, well learn how to deploy Scrapy Spiders to the cloud using Apify. This allows us to configure them to run on a schedule and access many other features of the platform.

To demonstrate this, well use the Apify SDK for Python and select the Scrapy development template to help us kickstart the setup process. Well then modify the generated boilerplate code to run our CrawlSpider Hacker News scraper. Let's get started.

Installing the Apify CLI

To start working with the Apify CLI, we need to install it first. There are two ways to do this: via the Homebrew package manager on macOS or Linux or via the Node.js package manager (NPM).

Via homebrew

On macOS (or Linux), you can install the Apify CLI via the Homebrew package manager.

brew install apify/tap/apify-cli

Via NPM

Install or upgrade the Apify CLI by running:

npm -g install apify-cli

Creating a new Actor

Once you have the Apify CLI installed on your computer, simply run the following command in the terminal:

apify create scrapy-actor

Then, go ahead and select Python Scrapy Install template

This command will create a new folder named scrapy-actor, install all the necessary dependencies, and create a boilerplate code that we can use to kickstart our development using Scrapy and the Apify SDK for Python.

Finally, move into the newly created folder and open it using your preferred code editor, in this example, Im using VS Code.

cd scrapy-actor
code .

Configuring the Scrapy Actor template

The template already creates a fully functional scraper. You can run it using the command apify run. If youd like to try it before we modify the code, the scraped results will be stored under storage/datasets .

Now that were familiar with the template, we can modify it to accommodate our HackerNews scraper.

To make our first adjustment, we need to replace the template code in src/spiders/title_spider.py with our own code. After the replacement, your code should look like this:

# Add imports CrawlSpider, Rule and LinkExtractor 👇
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# Change the spider from "scrapy.Spider" to "CrawlSpider"
class HackernewsSpider(CrawlSpider):
    name = 'hackernews'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['<https://news.ycombinator.com/news>']

    custom_settings = {
        'DOWNLOAD_DELAY': 1 # Add a 1-second delay between requests
    }

    # Define a rule that should be followed by the link extractor. 
    # In this case, Scrapy will follow all the links with the "morelink" class
    # And call the "parse_article" function on every crawled page
    rules = (
        Rule(LinkExtractor(allow=[r'news\\.ycombinator\\.com/news$']), callback='parse_article'),
        Rule(LinkExtractor(restrict_css='.morelink'), callback='parse_article', follow=True),
    )

    # When using the CrawlSpider we cannot use a parse function called "parse".
    # Otherwise, it will override the default function.
    # So, just rename it to something else, for example, "parse_article"
    def parse_article(self, response):
        for article in response.css('tr.athing'):
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
            }

Finally, before running the Actor, we need to make some adjustments to the main.py file to align it with the modifications we made to the original spider template.

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from apify import Actor

from .pipelines import ActorDatasetPushPipeline
from .spiders.hackernews_spider import HackernewsSpider

async def main():
    async with Actor:
        actor_input = await Actor.get_input() or {}
        max_depth = actor_input.get('max_depth', 1)
        start_urls = [start_url.get('url') for start_url in actor_input.get('start_urls', [{ 'url': '<https://news.ycombinator.com/news>' }])]

        settings = get_project_settings()
        settings['ITEM_PIPELINES'] = { ActorDatasetPushPipeline: 1 }
        settings['DEPTH_LIMIT'] = max_depth

        process = CrawlerProcess(settings, install_root_handler=False)

        # If you want to run multiple spiders, call `process.crawl` for each of them here
        process.crawl(HackernewsSpider, start_urls=start_urls)

        process.start()

Running the Actor locally

Great! Now we're ready to run our Scrapy actor. To do so, lets type the command apify run in our terminal. After a few seconds, the storage/datasets will be populated with the scraped data from Hacker News.

Deploying the Actor to Apify

Before deploying the Actor to Apify, we need to make one final adjustment. Go to .actor/input_schema.json and change the prefill URL to https://news.ycombinator.com/news. This change is important when running the scraper on the Apify platform.

Now that we know that our Actor is working as expected, it is time to deploy it to the Apify Platform. You will need to sign up for a free Apify account to follow along.

Once you have an Apify account, run the command apify login in the terminal. You will be prompted to provide your Apify API Token. Which you can find in Apify Console under Settings Integrations.

The final step is to run the apify push command. This will start an Actor build, and after a few seconds, you should be able to see your newly created Actor in Apify Console under Actors My actors.

Perfect! Your scraper is ready to run on the Apify platform. To begin, click the Start button. Once the run is finished, you can preview and download your data in multiple formats in the Storage tab.

Scrapy alternatives in 2025

A curated list of libraries for web scraping in Python.

blog.apify.com

Next steps

If you want to take your web scraping projects to the next level with the Apify SDK for Python and the Apify platform, here are some useful resources that might help you:

More Python Actor templates

Web Scraping Python tutorials

Web Scraping community on Discord

Finally, don't forget to join the Apify & Crawlee community on Discord to connect with other web scraping and automation enthusiasts. 🚀

Apify & Crawlee

This is the official developer community of Apify and Crawlee. | 11719 members

discord.com

Web scraping with Beautiful Soup and Requests

Percival Villalva — Thu, 30 Mar 2023 13:51:03 +0000

Introduction and requirements

The internet is an endless source of information, and for many data-driven tasks, accessing this information is critical. For this reason, web scraping, the practice of extracting data from websites, has become an increasingly important tool for machine learning developers, data analysts, researchers, and businesses alike.

One of the most popular web scraping tools is Beautiful Soup, a Python library that allows you to parse HTML and XML documents. Beautiful Soup makes it easy to extract specific pieces of information from web pages, and it can handle many of the quirks and inconsistencies that come with web scraping.

Another crucial tool for web scraping is Requests, a Python library for making HTTP requests. Python Requests allow you to send HTTP requests extremely easily and comes with a range of helpful features, including handling cookies and authentication.

In this article, we will explore the basics of web scraping with Beautiful Soup and Requests, covering everything from sending HTTP requests to parsing the resulting HTML and extracting useful data. We will also go over how to handle website pagination to extract data from multiple pages. Finally, we will explore a few tricks we can use to scrape the web ethically while avoiding getting our scrapers blocked by modern anti-bot protections.

To demonstrate all of that, we will build a Hacker News scraper using the Requests and Beautiful Soup Python libraries to extract the rank , URL , and title from all articles posted on HN. So, without further ado, let's start coding!

Initial setup

First, let's create a new directory hacker-news-scraper to house our scraper, then move into it and create a new file named main.py. We can either do it manually or straight from the terminal by using the following commands:

mkdir hacker-news-scraper

cd hacker-news-scraper

touch main.py

Still in the terminal, let's use pip to install Requests and Beautiful Soup. Finally, we can open our project in our code editor of choice. Since I'm using VS Code, I will use command code . to open the current directory in VS Code.

pip install requests beautifulsoup4

code .

How to make an HTTP GET request with Requests

In the main.py file, we will use Requests to make a GET request to our target website and save the obtained HTML code of the page to a variable named html and log it to the console.

Code

import requests

response = requests.get("<https://news.ycombinator.com/>")
html = response.text
print(html)

Output

And here is the result we expect to see after running our script:

Great! Now that we are properly targeting the page's HTML code, it's time to use Beautiful Soup to parse the code and extract the specific data we want.

Parsing the data with Beautiful Soup

Next, let's use Beautiful Soup to parse the HTML data and scrape the contents from all the articles on the first page of Hacker News.

import requests
from bs4 import BeautifulSoup

response = requests.get("<https://news.ycombinator.com/>")
html = response.text

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html)

Before we select an element, let's use the developer tools to inspect the page and find what selectors we need to use to target the data we want to extract.

When analyzing the website's structure, we can find each article's rank and title by selecting the element containing the class athing.

Traversing the DOM with the BeautifulSoup find method

Next, let's use Beautiful Soup find_all method to select all elements containing the athing class and save them to a variable named articles.

import requests
import json
from bs4 import BeautifulSoup

response = requests.get("<https://news.ycombinator.com/>")
html = response.text

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html)
articles = soup.find_all(class_="athing")

Next, to verify we have successfully selected the correct elements, let's loop through each article and print its text contents to the console.

import requests
from bs4 import BeautifulSoup

response = requests.get("<https://news.ycombinator.com/>")
html = response.text

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html)
articles = soup.find_all(class_="athing")

# Loop through the selected elements
for article in articles:
    # Log each article's text content to the console
    print(article.text)

Great! We've managed to access each element's rank and title.

In the next step, we will use BeautifulSoup's find method to grab the specific values we want to extract and organize the obtained data in a Python dictionary.

The find method is used to get the descendants of an element in the current set of matched elements filtered by a selector.

In the context of our scraper, we can use find to select specific descendants of each article element.

Returning to the Hacker News website, we can find the selectors we need to extract our target data.

Here's what our code looks like using the find method to get each article's URL, title, and rank :

import requests
from bs4 import BeautifulSoup

response = requests.get("<https://news.ycombinator.com/>")
html = response.text

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html)
articles = soup.find_all(class_="athing")

output = []

for article in articles:
    data = {
        "URL": article.find(class_="titleline").find("a").get('href'),
        "title": article.find(class_="titleline").getText(),
        "rank": article.find(class_="rank").getText().replace(".", "")
    }
    output.append(data)

print(output)

Finally, to make the data more presentable, lets use the json library to save our output to a JSON file. Here is what our code looks like:

import requests
from bs4 import BeautifulSoup
import json

response = requests.get("<https://news.ycombinator.com/>")
html = response.text

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html)
articles = soup.find_all(class_="athing")

output = []

# Extract data from each article on the page
for article in articles:
    data = {
        "URL": article.find(class_="titleline").find("a").get('href'),
        "title": article.find(class_="titleline").getText(),
        "rank": article.find(class_="rank").getText().replace(".", "")
    }
    output.append(data)

# Save scraped data
print('Saving output data to JSON file.')
save_output = open("hn_data.json", "w")  
json.dump(output, save_output, indent = 6, ensure_ascii=False)
save_output.close()

Great! We've just scraped information from all the articles displayed on the first page of Hacker News using Requests and Beautiful Soup. However, it would be even better if we could get the data from all articles on Hacker News, right?

Now that we know how to get the data from one page, we just have to apply this same logic to all the remaining pages of the website. So, in the next section, we will handle the websites pagination.

Handling Pagination

The concept of handling pagination in web scraping is quite straightforward. In short, we need to make our scraper repeat its scraping logic for each page visited until no more pages are left. To do that, we have to find a way to identify when the scraper reaches the last page and then stop scraping, and save our extracted data.

So, lets start by initializing three variables: scraping_hn, page, and output.

scraping_hn = True
page = 1
output = []

scraping_hn is a Boolean variable that keeps track of whether the script has reached the last page of the website.
page is an integer variable that keeps track of the current page number being scraped.
output is an empty list that will be populated with the scraped data.

Next, lets create a while loop that continues scraping until the scraper reaches the last page. Within the loop, we will send a GET request to the current page of Hacker News, so we can execute the rest of our script to extract the URL, title, and rank of each article and store the data in a dictionary with keys "URL" , "title" , and "rank". We will then append the dictionary to the output list.

scraping_hn = True
page = 1
output = []

print('Starting Hacker News Scraper...')

# Continue scraping until the scraper reaches the last page
while scraping_hn:
    response = requests.get(f"<https://news.ycombinator.com/?p={page}>")
    html = response.text

    print(f"Scraping {response.url}")

    # Use Beautiful Soup to parse the HTML
    soup = BeautifulSoup(html, features="html.parser")
    articles = soup.find_all(class_="athing")

    # Extract data from each article on the page
    for article in articles:
        data = {
            "URL": article.find(class_="titleline").find("a").get('href'),
            "title": article.find(class_="titleline").getText(),
            "rank": article.find(class_="rank").getText().replace(".", "")
        }
        output.append(data)

After extracting data from all articles on the page, we will write an if statement to check whether there is a More button with the class morelink on the page. We will check for this particular element because the More button is present on all pages, except the last one.

So, if the morelink class is present, the script increments the page variable and continues scraping the next page. If there is no morelink class, the script sets scraping_hn to False and exits the loop.

# Check if the scraper reached the last page
    next_page = soup.find(class_="morelink")

    if next_page != None:
        page += 1
    else:
        scraping_hn = False
        print(f'Finished scraping! Scraped a total of {len(output)} items.')

Putting it all together, here is the code we have so far:

import requests
import json
from bs4 import BeautifulSoup

scraping_hn = True
page = 1
output = []

print('Starting Hacker News Scraper...')

# Continue scraping until the scraper reaches the last page
while scraping_hn:
    response = requests.get(f"<https://news.ycombinator.com/?p={page}>")
    html = response.text

    print(f"Scraping {response.url}")

    # Use Beautiful Soup to parse the HTML
    soup = BeautifulSoup(html, features="html.parser")
    articles = soup.find_all(class_="athing")

    # Extract data from each article on the page
    for article in articles:
        data = {
            "URL": article.find(class_="titleline").find("a").get('href'),
            "title": article.find(class_="titleline").getText(),
            "rank": article.find(class_="rank").getText().replace(".", "")
        }
        output.append(data)

    # Check if the scraper reached the last page
    next_page = soup.find(class_="morelink")

    if next_page != None:
        page += 1
    else:
        scraping_hn = False
        print(f'Finished scraping! Scraped a total of {len(output)} items.')

# Save scraped data
print('Saving output data to JSON file.')
save_output = open("hn_data.json", "w")  
json.dump(output, save_output, indent = 6, ensure_ascii=False)
save_output.close()

In conclusion, our script successfully accomplished its goal of extracting data from all articles on Hacker News by using Requests and BeautifulSoup.

However, it is important to note that not all websites will be as simple to scrape as Hacker News. Most modern webpages have a variety of anti-bot protections in place to prevent malicious bots from overloading their servers with requests.

In our situation, we are simply automating a data collection process without any malicious intent against the target website. So, in the next section, we will talk about what measures we can use to reduce the likelihood of our scrapers getting blocked.

Avoid being blocked with Requests

Hacker News is a simple website without any aggressive anti-bot protections in place, so we were able to scrape it without running into any major blocking issues.

Complex websites might employ different techniques to detect and block bots, such as analyzing the data encoded in HTTP requests received by the server, fingerprinting, CAPTCHAS, and more.

Avoiding all types of blocking can be a very challenging task, and its difficulty varies according to your target website and the scale of your scraping activities.

Nevertheless, there are some simple techniques, like passing the correct User-Agent header that can already help our scrapers pass basic website verifications.

What is the User-Agent header?

The User-Agent header informs the server about the operating system, vendor, and version of the requesting client. This is relevant because any inconsistencies in the information the website receives may alert it about suspicious bot-like activity, leading to our scrapers getting blocked.

One of the ways we can avoid this is by passing custom headers to the HTTP request we made earlier using Requests, thus ensuring that the User-Agent used matches the one from the machine sending the request.

You can check your own User-Agent by accessing the http://whatsmyuseragent.org/ website. For example, this is my computer's User-Agent:

With this information, we can now pass the User-Agent header to our Requests HTTP request.

How to use the User-Agent header in Requests

In order to verify that Requests is indeed sending the specified headers, let's create a new file named headers-test.py and send a request to the website https://httpbin.org/.

To send custom headers using Requests, we will pass a params parameter to the request method:

import requests

headers = { 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
}

response = requests.get("<https://httpbin.org/headers>", headers=headers);
print(response.text);

After running the python3 headers-test.py command, we can expect to see our request headers printed to the console:

As we can verify by checking the User-Agent, Requests used the custom headers we passed as a parameter to the request.

In contrast, that's how the User-Agent for the same request would look like if we didn't pass any custom parameters:

Cool, now that we know how to properly pass custom headers to a Requests HTTP request, we can implement the same logic in our Hacker News scraper.

21 tips on how to crawl a website without getting blocked

Guide on how to solve or avoid anti-scraping protections.

blog.apify.com

Required headers, cookies, and tokens

Setting the proper User-Agent header will help you avoid blocking, but it is not enough to overcome more sophisticated anti-bot systems present in modern websites.

There are many other types of information, such as additional headers, cookies, and access tokens, that we might be required to send with our request in order to get to the data we want. If you want to know more about the topic, check out the Dealing with headers, cookies, and tokens section of the Apify Web Scraping Academy.

Restricting the number of requests sent to the server

Another common strategy employed by anti-scraping protections is to monitor the frequency of requests sent to the server. If too many requests are sent in a short period of time, the server may flag the IP address of the scraper and block further requests from that address.

An easy way to work around this limitation is to introduce a time delay between requests, giving the server enough time to process the previous request and respond before the next request is sent.

To do that, we can use the time.sleep() method before each HTTP request to slow down the frequency of requests to the server. This approach can help to reduce the chances of being blocked by anti-scraping protections and allow our script to scrape the website's data more reliably and efficiently.

time.sleep(0.5) # Wait before each request to avoid overloading the server
response = requests.get(f"<https://news.ycombinator.com/?p={page}>")

Final code

import requests
import json
from bs4 import BeautifulSoup
import time

scraping_hn = True
page = 1
output = []

headers = { 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
}

print('Starting Hacker News Scraper...')

# Continue scraping until the scraper reaches the last page
while scraping_hn:
    time.sleep(0.5) # Wait before each request to avoid overloading the server
    response = requests.get(f"https://news.ycombinator.com/?p={page}", headers=headers)
    html = response.text

    print(f"Scraping {response.url}")

    # Use Beautiful Soup to parse the HTML
    soup = BeautifulSoup(html, features="html.parser")
    articles = soup.find_all(class_="athing")

    # Extract data from each article on the page
    for article in articles:
        data = {
            "URL": article.find(class_="titleline").find("a").get('href'),
            "title": article.find(class_="titleline").getText(),
            "rank": article.find(class_="rank").getText().replace(".", "")
        }
        output.append(data)

    # Check if the scraper reached the last page
    next_page = soup.find(class_="morelink")

    if next_page != None:
        page += 1
    else:
        scraping_hn = False
        print(f'Finished scraping! Scraped a total of {len(output)} items.')

# Save scraped data
print('Saving output data to JSON file.')
save_output = open("hn_data.json", "w")  
json.dump(output, save_output, indent = 6, ensure_ascii=False)
save_output.close()

GitHub repository

PerVillalva / bs4-hn-scraper

BeautifulSoup + Requests scraper to extract data from Hacker News

Web scraping with Python

Percival Villalva — Tue, 14 Feb 2023 13:21:35 +0000

Explore some of the best Python libraries and frameworks available for web scraping and learn how to use them in your projects.

Getting started with web scraping in Python

Python is one of the most popular programming languages out there and is used across many different fields, such as AI, web development, automation, data science, and data extraction.

For years, Python has been the go-to language for data extraction, boasting a large community of developers as well as a wide range of web scraping tools to help scrapers extract almost any data they wish from the web.

This article will explore some of the best libraries and frameworks available for web scraping in Python and provide a quick sample of how to use them in different scraping scenarios.

Requirements

To fully understand the content and code samples showcased in this post, you should:

Have Python installed on your computer
Have a basic understanding of CSS selectors
Be comfortable navigating the browser DevTools to find and select page elements

HTTP Clients

In the context of web scraping, HTTP clients are used for sending requests to the target website and retrieving information such as the website's HTML code or JSON payload.

Requests

Requests is the most popular HTTP library for Python. It is supported by solid documentation and has been adopted by a huge community.

⚒️ Main Features

Keep-Alive & Connection Pooling
Browser-style SSL Verification
HTTP(S) Proxy Support
Connection Timeouts
Chunked Requests

⚙️ Installation

pip install requests

💡 Code Sample

Send a request to the target website, retrieve its HTML code, and print the result to the console.

import requests

response = requests.get('https://news.ycombinator.com/')

print(response.text)

HTTPX

HTTPX is a fully featured HTTP client library for Python 3, including an integrated command-line client while providing both sync and async APIs.

⚒️ Main Features

A broadly requests-compatible API
An integrated command-line client
Standard synchronous interface, but with async support if you need it
Fully type annotated

⚙️ Installation

# Using pip
pip install httpx

# For Python 3 macOS users
pip3 install httpx

💡 Code Sample

Similar to the Requests example, we will send a request to the target website, retrieve the HTML of the page and print it to the console along with the request status code.

import httpx

response = httpx.get('https://news.ycombinator.com/')

status_code = response.status_code
html = response.text

print(status_code, html)

HTML and XML parser

In web scraping, HTML and XML parsers are used to interpret the response we get back from our target website, often in the form of HTML code.* A library such as Beautiful Soup will help us parse this response and extract data from websites.*

Beautiful Soup

Beautiful Soup (also known as BS4) is a Python library for pulling data out of HTML and XML files with just a few lines of code. BS4 is relatively easy to use and presents itself as a lightweight option for tackling simple scraping tasks with speed.

⚒️ Main features

Implements a subset of core jQuery, providing developers with a familiar and easy-to-use syntax.
Works with a simple and consistent DOM model, making parsing, manipulating, and rendering incredibly efficient.
Offers great flexibility, being able to parse nearly any HTML or XML document.

⚙️ Installation

pip install beautifulsoup4

💡 Code Sample

Let's now see how we can use Beautiful Soup + HTTPX to extract the title content, rank, and URL from all the articles on the first page of Hacker News.

from bs4 import BeautifulSoup
import httpx

response = httpx.get("https://news.ycombinator.com/news")
yc_web_page = response.content

soup = BeautifulSoup(yc_web_page)
articles = soup.find_all(class_="athing")

for article in articles:
    data = {
        "URL": article.find(class_="titleline").find("a").get('href'),
        "title": article.find(class_="titleline").getText(),
        "rank": article.find(class_="rank").getText().replace(".", "")
    }
    print(data)

A few seconds after running the script, we will see a dictionary containing each article's URL, ranking, and title printed on our console.

Output example:


{'URL': 'https://vpnoverview.com/news/wifi-routers-used-to-produce-3d-images-of-humans/', 'title': 'WiFi Routers Used to Produce 3D Images of Humans (vpnoverview.com)', 'rank': '1'}
{'URL': 'https://openjdk.org/jeps/8300786', 'title': 'JEP draft: No longer require super() and this() to appear first in a constructor (openjdk.org)', 'rank': '2'}
{'URL': 'item?id=34482433', 'title': 'Ask HN: Those making $500+/month on side projects in 2023 -- Show and tell', 'rank': '3'}
{'URL': 'https://www.solipsys.co.uk/new/ThePointOfTheBanachTarskiTheorem.html?wa22hn', 'title': 'The Point of the Banach-Tarski Theorem (solipsys.co.uk)', 'rank': '4'}
{'URL': 'https://initialcommit.com/blog/git-sim', 'title': 'Git-sim: Visually simulate Git operations in your own repos (initialcommit.com)', 'rank': '5'}
{'URL': 'https://www.cell.com/cell-reports-medicine/fulltext/S2666-3791(22)00474-8', 'title': 'Brief structured respiration enhances mood and reduces physiological arousal (cell.com)', 'rank': '6'}
{'URL': 'https://en.wikipedia.org/wiki/I,_Libertine', 'title': 'I, Libertine (wikipedia.org)', 'rank': '7'}
{'URL': 'item?id=34465956', 'title': 'Ask HN: Why did BASIC use line numbers instead of a full screen editor?', 'rank': '8'}
{'URL': 'https://arxiv.org/abs/2203.03456', 'title': 'Negative-weight single-source shortest paths in near-linear time (arxiv.org)', 'rank': '9'}
{'URL': 'https://onesignal.com/careers', 'title': 'OneSignal (YC S11) Is Hiring Engineers (onesignal.com)', 'rank': '10'}
{'URL': 'https://neelc.org/posts/chatgpt-gmail-spam/', 'title': "Bypassing Gmail's spam filters with ChatGPT (neelc.org)", 'rank': '11'}
{'URL': 'https://cyber.dabamos.de/88x31/', 'title': 'The 88x31 GIF Collection (dabamos.de)', 'rank': '12'}
{'URL': 'https://www.middleeasteye.net/opinion/david-graeber-vs-yuval-harari-forgotten-cities-myths-how-civilisation-began', 'title': 'The Dawn of Everything challenges a mainstream telling of prehistory (middleeasteye.net)', 'rank': '13'}
{'URL': 'https://blog.thinkst.com/2023/01/swipe-right-on-our-new-credit-card-tokens.html', 'title': 'Detect breaches with Canary credit cards (thinkst.com)', 'rank': '14'}
{'URL': 'https://www.atlasobscura.com/articles/heritage-appalachian-apples', 'title': 'Appalachian Apple hunter who rescued 1k 'lost' varieties (2021) (atlasobscura.com)', 'rank': '15'}
{'URL': 'https://www.workingsoftware.dev/software-architecture-documentation-the-ultimate-guide/', 'title': 'The Guide to Software Architecture Documentation (workingsoftware.dev)', 'rank': '16'}
{'URL': 'https://arstechnica.com/tech-policy/2023/01/supreme-court-allows-reddit-mods-to-anonymously-defend-section-230/', 'title': 'Supreme Court allows Reddit mods to anonymously defend Section 230 (arstechnica.com)', 'rank': '17'}
{'URL': 'https://neurosciencenews.com/insula-empathy-pain-21818/', 'title': 'How do we experience the pain of other people? (neurosciencenews.com)', 'rank': '18'}
{'URL': 'https://lwn.net/SubscriberLink/920158/313ec4305df220bb/', 'title': 'Nolibc: A minimal C-library replacement shipped with the kernel (lwn.net)', 'rank': '19'}
{'URL': 'https://www.economist.com/1843/2017/05/04/the-body-in-the-buddha', 'title': 'The Body in the Buddha (2017) (economist.com)', 'rank': '20'}
{'URL': 'https://simonwillison.net/2023/Jan/13/semantic-search-answers/', 'title': 'How to implement Q&A against your docs with GPT3 embeddings and Datasette (simonwillison.net)', 'rank': '21'}
{'URL': 'https://destevez.net/2023/01/decoding-lunar-flashlight/', 'title': 'Decoding Lunar Flashlight (destevez.net)', 'rank': '22'}
{'URL': 'https://www.hampsteadheath.net/about', 'title': 'Hampstead Heath (hampsteadheath.net)', 'rank': '23'}
{'URL': 'https://www.otherlife.co/francisbacon/', 'title': 'The violent focus of Francis Bacon (otherlife.co)', 'rank': '24'}
{'URL': 'https://arstechnica.com/gaming/2019/10/explaining-how-fighting-games-use-delay-based-and-rollback-netcode/', 'title': 'How fighting games use delay-based and rollback netcode (2019) (arstechnica.com)', 'rank': '25'}
{'URL': 'https://essays.georgestrakhov.com/ai-is-not-a-horse/', 'title': 'AI Is Not a Horse (georgestrakhov.com)', 'rank': '26'}
{'URL': 'https://lawliberty.org/features/the-mystery-of-richard-posner/', 'title': 'The Mystery of Richard Posner (lawliberty.org)', 'rank': '27'}
{'URL': 'https://rodneybrooks.com/predictions-scorecard-2023-january-01/', 'title': 'Rodney Brooks Predictions Scorecard (rodneybrooks.com)', 'rank': '28'}
{'URL': 'https://www.notamonadtutorial.com/how-to-transform-code-into-arithmetic-circuits/', 'title': 'How to transform code into arithmetic circuits (notamonadtutorial.com)', 'rank': '29'}
{'URL': 'https://github.com/jhhoward/WolfensteinCGA', 'title': 'Wolfenstein 3D with a CGA Renderer (github.com/jhhoward)', 'rank': '30'}

Browser automation tools

Browser automation libraries and frameworks have an off-label use for web scraping. Their ability to emulate a real browser is essentialfor access*ing* data on websites that require JavaScript to load their content.**

Selenium

Selenium is primarily a browser automation framework and ecosystem with an off-label use for web scraping. It uses the WebDriver protocol to control a headless browser and perform actions like clicking buttons, filling out forms, and scrolling.

Because of its ability to render JavaScript, Selenium can be used to scrape dynamically loaded content.

⚒️ Main features

Multi-Browser Support (Firefox, Chrome, Safari, Opera...)
Multi-Language Compatibility
Automate manual user interactions, such as UI testing, form submissions, and keyboard inputs.
Dynamic web elements handling

⚙️ Installation

# Install Selenium
pip install selenium

# We will also need to install webdriver-manager to run the code sample below
pip install webdriver-manager

💡 Code Sample

To demonstrate some of Selenium's capabilities, let's go to Amazon, scrape The Hitchhiker's Guide to the Galaxy product page, and save a screenshot of the accessed page.

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Insert the website URL that we want to scrape
url = "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C"

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)

# Create a dictionary with the scraped data
book = {
    "book_title": driver.find_element(By.ID,  'productTitle').text,
    "author": driver.find_element(By.CSS_SELECTOR, '.a-link-normal.contributorNameID').text,
    "edition": driver.find_element(By.ID, 'productSubtitle').text,
    "price": driver.find_element(By.CSS_SELECTOR,  '.a-size-base.a-color-price.a-color-price').text,
}

# Save a screenshot from the accessed page and print the dictionary contents to the console
driver.save_screenshot('book.png')
print(book)

After the script finishes its run, we will see an object containing the book's title, author, edition, and prices logged to the console, and a screenshot of the page saved as book.png .

Output example:

{
    "book_title": "The Hitchhiker's Guide to the Galaxy: The Illustrated Edition",
    "author": "Douglas Adams",
    "edition": "Kindle Edition",
    "price": "$7.99"
}

Saved image:

Playwright

By definition, Playwright is an open-source framework for web testing and automation developed and maintained by Microsoft.

Despite having many features in common with Selenium, Playwright is considered a more modern and capable choice for automation, testing, and web scraping in Python.

⚒️ Main features

Auto-wait. Playwright, by default, waits for elements to be actionable before performing actions, eliminating the need for artificial timeouts.
Cross-browser support, being able to drive Chromium, WebKit, Firefox, and Microsoft Edge.
Cross-platform support. Available on Windows, Linux, and macOS, locally or on CI, headless, or headed.

⚙️ Installation

# Using pip
pip install pytest-playwright

# For Python 3 macOS users
pip3 install pytest-playwright

# Install the required browsers
playwright install

💡 Code Sample

To highlight Playwright's features as well as its similarities with Selenium, let's go back to Amazon's website and extract some data from The Hitchhiker's Guide to the Galaxy.

Playwright version:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.firefox.launch(
        headless=False
    )
    page = browser.new_page()
    page.goto("https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C")

    # Create a dictionary with the scraped data
    book = {
    "book_title": page.query_selector('#productTitle').inner_text().strip(),
    "author": page.query_selector('.author .a-link-normal.contributorNameID').inner_text().strip(),
    "edition": page.query_selector('#productSubtitle').inner_text().strip(),
    "price": page.query_selector('.a-size-base.a-color-price.a-color-price').inner_text().strip(),
    }

    print(book)
    page.screenshot(path="book.png")

    browser.close()

After the scraper finishes its run, the Firefox browser controlled by Playwright will close, and the extracted data will be logged into the console.

Scrapy: a full-fledged Python web crawling framework

Scrapy

Scrapy is a fast high-level web crawling and web scraping framework written with Twisted, a popular event-driven networking framework, which gives it asynchronous capabilities.

Unlike the tools mentioned earlier, Scrapy is a full-fledged web crawling framework designed specifically for data extraction, with built-in support for handling requests, processing responses, and exporting data.

Additionally, Scrapy provides handy out-of-the-box features, such as support for following links, handling multiple request types, and error handling, making it a powerful tool for web scraping projects of any size and complexity.

⚒️ Main features

Feed exports in multiple formats, such as JSON, CSV, and XML.
Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions
An interactive shell console for trying out the CSS and XPath expressions to scrape data and debug your spiders.
Built-in extensions and middlewares for handling, cookies, HTTP authentication and caching user-agent spoofing, and more

⚙️ Installation

pip install scrapy

📁 Project setup

To demonstrate some Scrapy's features, we will once again extract data from articles displayed on Hacker News.

We will start by scraping the top 30 articles and then use Scrapy's CrawlSpider to follow the available page links and extract data from all the articles on the website.

To begin, let's create a new directory and install Scrapy to initialize the project and create a new spider:

# Create new directory and move into it
mkdir scrapy-project
cd scrapy-project

# Install Scrapy
pip install scrapy

# Initialize project
scrapy startproject scrapydemo

# Generate spider
scrapy genspider demospider https://news.ycombinator.com/

After our spider is generated, let's specify the encoding for the output file, which will contain the data scraped from the target website by adding FEED_EXPORT_ENCODING = "utf-8" to our settings.py file.

💡 Code Sample

Finally, go to the demospider.py file and write some code:

import scrapy

class DemospiderSpider(scrapy.Spider):
    name = 'demospider'

    def start_requests(self):
        yield scrapy.Request(url='https://news.ycombinator.com/')

    def parse(self, response):
        for article in response.css('tr.athing'):
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
        }

Then, let's use the following command to run the spider and store the scraped data in a results.json file:

scrapy crawl demospider -o results.json

🕷️ Using Scrapy's CrawlSpider

Now that we know how to extract data from the articles on the first page of Hacker News let's use Scrapy's CrawlSpider to follow the next page links and collect the data from all the articles on the website.

To do that, we will make some adjustments to our demospider.py file:

# Add imports CrawlSpider, Rule and LinkExtractor 👇
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

# Change the spider from "scrapy.Spider" to "CrawlSpider"
class DemospiderSpider(CrawlSpider):
    name = 'demospider'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['https://news.ycombinator.com/news?p=1']

    # Define a rule that should be followed by the link extractor.
    # In this case, Scrapy will follow all the links with the "morelink" class
    # And call the "parse_article" function on every crawled page
    rules = (
        (Rule(LinkExtractor(restrict_css='.morelink'), callback='parse_article', follow=True),)
    )

    # When using the CrawlSpider we cannot use a parse function called "parse".
    # Otherwise, it will override the default function.
    # So, just rename it to something else, for example, "parse_article"
    def parse_article(self, response):
        for article in response.css('tr.athing'):
            yield {
                "URL": article.css(".titleline a::attr(href)").get(),
                "title": article.css(".titleline a::text").get(),
                "rank": article.css(".rank::text").get().replace(".", "")
            }

Finally, let's add a small delay between each of Scrapy's requests to avoid overloading the server. We can do that by adding DOWNLOAD_DELAY = 0.5 to our settings.py file.

Great! Now we are ready to run our scraper and get the data from all the articles displayed on Hacker News. Just run the command scrapy crawl demospider -o results.json and wait for the run to finish.

Expected output:

🎭 Using Playwright with Scrapy

Scrapy and Playwright are one of the most efficient combos for modern web scraping in Python.

This combo allows us to benefit from Playwright's ability to access dynamically loaded content on websites, and retrieve code from the page, so we can use Scrapy to extract data from it.

To integrate Playwright with Scrapy, we will use the scrapy-playwright library. Then, we will scrape https://www.mintmobile.com/product/google-pixel-7-pro-bundle/ to demonstrate how to extract data from a website using Playwright and Scrapy.

Mint Mobile requires JavaScript to load most of the content displayed on its product page, which makes it an ideal scenario for using Playwright in the context of web scraping.

Mint Mobile product page with JavaScript disabled:

Mint Mobile product page with JavaScript enabled:

⚙️ Project setup

Start by creating a directory to house our project and installing the necessary dependencies:

# Create new directory and move into it
mkdir scrapy-playwright
cd scrapy-playwright

Installation:

# Install Scrapy and scrapy-playwright
pip install scrapy scrapy-playwright

# Install the required browsers if you are running Playwright for the first time
playwright install

# Or install a subset of the available browsers you plan on using
playwright install firefox chromium

Next, start the Scrapy project and generate a spider:

scrapy startproject pwsdemo
scrapy genspider demospider https://www.mintmobile.com/

Now, let's activate scrapy-playwright by adding DOWNLOAD_HANDLERS and TWISTED_REACTOR to the scraper configuration in settings.py

# scrapy-playwright configuration

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Great! We are now ready to write some code to scrape our target website.

💡 Code Sample

So, without further ado, let's use Playwright + Scrapy to extract data from Mint Mobile.

import scrapy
from scrapy_playwright.page import PageMethod

class DemospiderSpider(scrapy.Spider):
    name = 'demospider'

    def start_requests(self):
        yield scrapy.Request('https://www.mintmobile.com/product/google-pixel-7-pro-bundle/',
        meta= dict(
            # Use Playwright
            playwright = True,
            # Keep the page object so we can work with it later on
            playwright_include_page = True,
            # Use PageMethods to wait for the content we want to scrape to be properly loaded before extracting the data
            playwright_page_methods = [
                PageMethod('wait_for_selector', 'div.m-productCard--device')
                ]
        ))

    def parse(self, response):
        yield {
            "name": response.css("div.m-productCard__heading h1::text").get().strip(),
            "memory": response.css("div.composited_product_details_wrapper > div > div > div:nth-child(2) > div.label > span::text").get().replace(':', '').strip(),
            "pay_monthly_price": response.css("div.composite_price_monthly > span::text").get(),
            "pay_today_price": response.css("div.composite_price p.price span.amount::attr(aria-label)").get().split()[0],
    };

Expected output:
Finally, run the spider using the command scrapy crawl demospider -o results.json to scrape the target data and store it in a results.json file:

[
    {
        "name": "Google Pixel 7 Pro",
        "memory": "128GB",
        "pay_monthly_price": "50",
        "pay_today_price": "589"
    }
]

Learning resources 📚

If you want to dive deeper into some of the libraries and frameworks we presented during this post, here is a curated list of great videos and articles about the topic:

General web scraping

Beautiful Soup Tutorials

How to scrape data in Python using Beautiful Soup

Browser automation tools

Scrapy

Discord

Finally, don't forget to join the Apify & Crawlee community on Discord to connect with other web scraping and automation enthusiasts. 🚀

Apify & Crawlee

This is the official developer community of Apify and Crawlee. | 11719 members

discord.com

DEV Community: Percival Villalva

Crawlee data storage types: saving files, screenshots, and JSON results

Setting up Crawlee

Crawlee request queue

Crawlee request list

Crawlee datasets

Key-value store

Saving screenshots

Saving pages as PDF files

Doing more with your Crawlee scraper

Selenium page object model: what is POM and how can you use it?

What is the Page Object Model (POM)?

Why is POM essential for Selenium automation?

Advantages of using POM

Setting up your environment

Prerequisites

Creating a Selenium Project

Creating Page Objects

What is a Page Object?

Step 1: Define the Page Object class

Step 2: Define web elements and actions

Step 3: Implement methods

Writing test cases with POM

Test case 1: Logging in with valid user credentials

Test case 2: Logging in with an invalid username

Test case 3: Logging in with an invalid password

Running all tests

Handling page navigation and dynamic elements

Handling dynamic elements

Handling page navigation

Running tests and reporting

Generating Basic Test Reports

Read more about Selenium

6 things you should know before buying or building a web scraper

1. Every website is different

Common factors of web scraper complexity

Anti-scraping protections

Architecture of the website

2. Websites change without asking

How can you handle website changes?

1: Nice-to-have data

2: Business-critical data

3: Mission-critical data

3. Small changes in web scraper specifications can cause dramatic changes in cost

4. There are legal limits to what you can scrape

5. Start with a proof of concept for your web scraper

6. Prepare for turning data into insights

Anything else?

Selenium Grid: what it is and how to set it up

What is Selenium Grid?

What are the benefits of using Selenium Grid?

When to use Selenium Grid?

Large test suites and parallel execution

Cross-browser and cross-platform testing

Scaling test infrastructure

Continuous integration (CI) pipelines

Geographically distributed testing

Example scenario: e-commerce website testing

Selenium Grid architecture

Communication flow

Load balancing

Handling failures

Web browser drivers

How to set up Selenium Grid

Prerequisites

Start Selenium Grid Hub

Add Nodes to the Hub

Verify Hub and Nodes

Running tests using Selenium Grid

Handling test failures

Parallel test execution tips

Python and machine learning

What is machine learning?

How Python and machine learning come together

5 steps in developing a machine learning model with Python

1. Data preparation and exploration

2. Feature engineering and selection

3. Model building and training

4. Model evaluation and validation

Evaluation metrics

**4. There are legal limits to what you can scrape**