DEV Community: Usama Jamil

Web scraping with Python Requests

Usama Jamil — Wed, 24 May 2023 16:15:56 +0000

Python has become a language widely used by developers because it's easy, efficient, and flexible. Here we're going to learn how to scrape websites with Python. We will learn a few basic concepts, but our main focus will be web scraping in Python with the Requests library.

In simple words, web scraping means getting a website's content and extracting relevant data from that content. Almost all programming languages provide support to automate this process.

Likewise, Python has also libraries to make this process easier and faster.

Why is Python used for web scraping?

There are many reasons why you should use Python for web scraping. Here are a few of them:

Python has an enormous community that provides hundreds of articles and documentation for beginners. This makes things easier for beginners, and they can learn quickly.
Python is renowned for its readability, making it easier to understand and write code.
It has pre-defined functions, which means that we don't need to write functions. This saves time and makes the process faster.
We don't have to define the data types: just define a variable and give it a value of any type.
It provides data processing and visualization tools, making data analysis and structuring easier.

Why is Python used for web scraping? The pros and cons.

What are the pros and cons of web scraping with Python?

blog.apify.com

What are the best Python tools and libraries for web scraping?

Depending on your needs and use cases, Python has several tools and libraries for web scraping. Let's take a look at a few of them:

Scrapy: A web crawling framework that provides a complete set of tools for web scraping and helps to structure data.
BeautifulSoup: Used for parsing HTML and XML documents. It creates a parsed tree for the web pages and allows us to extract data.
Selenium: A web scraping and automation tool that supports multiple browsers like Chrome, Firefox, and Edge.
PyQuery: A Python library that uses a jQuery-like syntax. It uses the ElementTree Python API, allowing us to manipulate and extract data from HTML documents.
Newspaper3k: Explicitly designed to scrape news websites. It's built on top of the libraries like BeautifulSoup and lxml. It automatically detects and extracts news content, and it can handle pagination as well.
Requests: A Python library that allows us to make HTTP requests to get and manipulate web pages.

In this article, our main focus will be the Requests library. So, let's start our journey.

Getting started with Python Requests

The Python Requests library has made HTTP requests very simple because it's very lightweight and efficient, making it a great choice for web scraping.

It's the most downloaded Python package, and there's a good reason for that. It does a really good job of taking on tasks that used to be complicated and confusing. That's why the tagline for the Requests library is HTTP for humans.

How to set up Python Requests?

Let's install the library so we can use it. Open up your terminal or command line and enter the following command to create a new directory and a Python file.

mkdir pythonRequests
cd pythonRequests
touch main.py

We'll use pip3 to install Requests:

pip3 install requests

This should be a fairly quick download. Once the command is executed, you can confirm the installation by simply running the following command:

pip3 show requests

This should print the name of the library with other information. Once we're done with the installation, we're ready to make our HTTP requests.

💡 You need to be careful while installing packages through pip. If you're using Python 3 or above, you need to use pip3 with that. Otherwise, it will create issues for you.

How to send HTTP requests with Requests?

With the Requests library, you can not only get information from different websites but also send information to the pages, download images, and perform many other tasks. Let's see what this library offers:

GET : The GET method allows us to get information from a website.
POST : The POST method will enable us to send information back to the server, like submitting forms.
PUT : The PUT method is used to update the existing data.
DELETE : This is simply used to delete data from the server.

These methods work just like having a discussion with the server. In return for these methods, we also get a response from the server. This response comes with different codes called status codes. For example, the server returns a 200 when it has that particular requested thing or file; 404 means it has nothing related to the request, and so on.

Now comes the point where you make HTTP requests. Simply create a Request object using the appropriate method to make a request. Let's say you simply want to get the data from a website; use the GET method for this.

import requests
response = requests.get('https://www.lipsum.com/')
print(response.status_code)
print(response.content)

This code sends a GET request to the Lorem Ipsum website and prints the response status code and content.

💡 One thing that needs to be mentioned here is that the Requests library is a great tool for making HTTP requests, but it's not meant to parse the HTML pages and get information. If you want to do that, you need to use it with other libraries, like Beautiful Soup.

Handling HTTP responses with Requests

Once you've got the response, you can apply different methods to it; a few of those methods are below:

response.status_code: This method returns the status code of the response.
response.content: The content method is used to get the content in bytes.
response.text: This is used to get the content in the form of Unicode text.
response.json(): This method is useful when the response is in the JSON format. It returns data as a dictionary.
response.headers: This method gives us the headers having information like the type of data, length, and encoding; it's the meta-data - information about the information.

💡 The response.content and response.text extract data in different formats. The response.content is useful when dealing with images or pdf. The response.text helps us to extract data in the form of HTML.

Let's make a request and see whether it's successful.

import requests
response = requests.get("https://www.wikipedia.org")
if response.status_code == 200:
    print("It's a successful Get requested")
else:
    print("The server returned the Status code: ", response.status_code)

This code sends a GET request to the website. It returns " It's a successful Get requested" if the response is successful. Otherwise, it prints " The server returned the Status code:" with the status code.

How the Requests library handles different types of data

While extracting data from websites, we may interact with different types of data. Requests specifically provides built-in methods to handle different data types. Here are a few of those with examples.

HTML

After making a request to a website, if the server returns the HTML of the webpage, this is how you can handle it:

import requests
response = requests.get("https://www.wikipedia.org")
htmlResponse = response.text
print(htmlResponse)

The code sends the GET request to the Wikipedia website and converts the response into HTML.

JSON

Let's say you're extracting a dataset in JSON: the Requests library provides a function that automatically converts the data to a Python dictionary.

import requests
response=requests.get("https://jsonplaceholder.typicode.com/posts/1")
jsonResponse = response.json()
print(jsonResponse)

The code sends the GET request to the jsonplaceholder website and returns the data in the form of a Python dictionary.

Binary data

The Requests library provides support to handle binary data or the content in bytes, like images or pdf files. Let's say you want to download the logo of Google:

import requests
response = requests.get("https://www.google.com/images/branding/googlelogo/2x/googlelogo_light_color_272x92dp.png")
googleLogo = response.content
with open("logo.png", "wb") as file:
    file.write(googleLogo)

This code sends a GET request to the provided link and returns the response in bytes. The open() method is used here to write data in the wb (Write Binary) mode that opens a file logo.png and writes the data in that file.

💡 The open() method opens the file provided as an argument. If the file is not present, it creates one with the same name and writes the data in it.

Web scraping in Python with Beautiful Soup & Requests

Detailed tutorial with code examples and some handy tricks.

blog.apify.com

How the HTML data is parsed and extracted

As we've already seen, the Requests library only allows us to get the data of a website, but it doesn't provide support to parse that data. So, we need another library for this. In this tutorial, we're going to get help from Beautiful Soup.

To install Beautiful Soup in your project, just enter the following command on the command line and press Enter.

pip3 install beautifulsoup4

Now you're ready to parse and extract data from any website. Let's say you want all the blog titles from the Apify website.

The code for it looks like this:

import requests
from bs4 import BeautifulSoup
response = requests.get("https://blog.apify.com/")
# Create an object of BeautifulSoup by parsing the content
soup = BeautifulSoup(response.content, "html.parser")
# Find all the elements with the class "post-title" and "h2" heading
postTitles = soup.find_all("h2", class_="post-title")
# Loop through the list of all the post_titles
for postTitle in postTitles:
    print(postTitle.text.strip())

This code sends a GET request to the Apify website, parses the HTML content with Beautiful Soup, and finds all blog titles.

What are the challenges of web scraping with Python Requests?

With every tool comes functionalities and limitations. Python Requests has limitations as well. Let's go through a few of these.

All the requests are synchronous , which means every request will block the execution of the program. It can cause issues while making a large number of requests. To avoid this, you can use asyncio and gevent Python libraries. These libraries allow you to make asynchronous requests.
The Python Requests library is only designed to make HTTP or HTTPS requests. It doesn't provide support for non-HTTP protocols. To make non-HTTP requests like FTP or SSH , you need to use other Python libraries like ftplib or paramiko.
It loads the entire response to the memory. This can cause problems when dealing with large-sized files. To download files in chunks, you can pass the stream parameter or use other libraries like wget or urllib3.
It doesn't handle retries automatically. Try using the requests-retry library, which provides a retry decorator for the Requests library.

Advanced web scraping techniques

Python Requests provides some advanced features like handling cookies, authentication, session management, etc. They can be used to avoid blocking and improve scraping efficiency.

How to handle Cookies with Python Requests?

A cookie is a small piece of text sent from the website to the user. It is stored in the user's browser to remember information like products in the cart and login information. So, next time you add a product to the cart or sign in somewhere and accidentally close the browser window, you open the website and find the same state. That's done through cookies. They maintain the state of a website. They're also used for tracking users' behavior, such as which pages they visit and how long they stayed on the site.

Through Python Requests, we have the control to access and add cookies. Let's see how we can access cookies:

import requests
# send the GET request
response = requests.get('https://stackoverflow.com/')
# Get the cookies from the response
cookies = response.cookies
print(cookies)

This code sends a GET request to the StackOverflow website and gets the cookies from it.

Now, say we want to send cookies to the httpbin website. The cookies parameter in the request allows us to send cookies back to the server.

import requests
# Create a dictionary of cookies
cookies = {'exampleCookie': '123456789'}
response = requests.get('http://httpbin.org/cookies', cookies=cookies)
print(response.text)

This code sends a Get a request and cookies as a parameter. The server will get the cookies from the request and process them according to its own implementation.

How to authenticate using Python Requests?

To authenticate websites, Requests provides automatic authentication support. We just need to give the names of the fields with the correct credentials, and BOOM! It automatically finds the form, fills in the fields, and presses the sign-in button. It's so easy.

Here's how it's done:

import requests
# Login credentials
credentials = {
    'email': 'yourEmail',
    'password': 'yourPassword'
}

response = requests.post('https://newsapi.org/login', data=credentials)

if response.status_code == 200:
    homePage = response.text
    print('Authentication Successful!')
else:
    print('Authentication failed!')

This code sends a POST request to the news api website with the authentication credentials. It stores the HTML of the homepage in the homePage variable and prints " Authentication Successful!" if the authentication is successful; otherwise, it prints " Authentication failed!".

💡 You can still post your credentials by using the GET request and including it in the query string. This is generally not recommended because the query string can be visible to third parties and may be cached by intermediaries such as proxies or caching servers.

How does session management work in Requests?

A session can be used to store users' information on the server throughout their interaction with the website.

Instead of storing the changing information of the user through cookies in the browser, the server gives the browser a unique id and stores some temporary variables. Every time that particular browser makes a request, a server receives the id and retrieves the variables.

In the context of web scraping, sometimes we need to authenticate a website and navigate to different routes. Sessions are used to maintain the performance and stateful information between those pages.

Here's an example in which we'll first log in and then try to access the subscription page.

💡 To make this example executable and to get the desired output, you need to provide the same credentials you provided in the above example. If you dont have account, you can create one for free.

We'll also extract the current plan from the subscription page.

import requests
from bs4 import BeautifulSoup
# create session object
session = requests.Session()
# Login credentials
credentials = {
    'email': 'yourEmail',
    'password': 'yourPassword'
}

response = session.post('https://newsapi.org/login', data=credentials)
# Show an error if the status_code is not 200
if response.status_code != 200:
    print("Login failed.")
else:
    subscriptionResponse = session.get('https://newsapi.org/account/manage-subscription')
  soup = BeautifulSoup(subscriptionResponse.content, 'html.parser')
  subscriptionPlan = soup.find('div', class_ = 'mb2')
  #If the subscriptionPlan is not found
  if subscriptionPlan is not None:
      print(subscriptionPlan.text.strip())
  else:
      print("Failed to find plan element.")

How to use proxies with Python Requests?

When trying to make multiple HTTP requests to a website through a script, the website may detect it and block us from making more requests. So, in order to avoid this, we can use different proxies. With multiple proxies, the website will think that the request is coming from different sources and won't limit the usage.

Here's an example of using proxies with Requests:

import requests

# Different proxies
proxies = {
   'http': 'http://103.69.108.78:8191',
   'http': 'http://198.199.86.11:8080',
   'http': 'http://200.105.215.22:33630',
}

# Make four GET requests
for i in range(4):
  response = requests.get('https://httpbin.org', proxies=proxies)
# Print **Request successful** if the status code is **200**
  if response.status_code == 200:
    print("Request successful")
# Print **Request failed** otherwise
  else:
    print("Request failed")

This code sends four GET requests to the HTTP website using multiple proxies and prints the message accordingly. You can get different proxies from this Free Proxy List.

User-agents

Modern-day websites can easily detect the requests made through any script. To avoid detection, we can add user agents in the headers of our request to make it look like a real browser request. It's a very important technique that has been used to avoid blocking.

import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
response = requests.get('https://apify.com/', headers=headers)
if response.status_code == 200:
    print("Request successful")
# Print Request failed otherwise
else:
  print("Request failed")

This code sends a GET request to the HTTP website using the user-agents and prints a message according to the response. Generate User-agents Online allows you to generate user-agents as well.

Throttling and rate limiting

Throttling and rate limiting are very important techniques that allow us to limit the number of requests and produce a delay between multiple requests. If we send requests rapidly, the website may detect our bot and block us. So, to avoid this thing, we can use these techniques:

import requests
# Import the time library to add delays between the requests
import time
# Make 10 requests
for i in range(10):
    response = requests.get('https://blog.apify.com/')
    # Add a 1-second delay between requests
    time.sleep(1)
    if response.status_code == 200:
      print("Request successful")
    # Print Request failed otherwise
    else:
      print("Request failed")

This code sends 10 GET requests to Apify with a 1-second delay between each request.

Testing and debugging web scraping with Requests

The process of web scraping can lead to errors and technical issues. It's good to test the code before deployment to avoid potential errors, but every beginner may get stuck on a few common problems:

What are common errors in web scraping?

Here are a few errors that can affect our scraping scripts.

400 Bad Request : This status code means that something is wrong on the user end that the server can't understand. For example, invalid request message framing, malformed request syntax, or deceptive request routing. We can confirm whether or not the URL or the headers are correct.
401 Unauthorized : This error occurs when we try to access a resource that requires some authentication. So, if we're using some credentials, we need to double-check them.
403 Forbidden : This error occurs when the server denies access to the user or the script attempting to access the website. We'll see this error many times when scraping. We can use user-agents, request headers, or rotate proxies to avoid this error.
404 Not Found : This error occurs when we try to access a resource that is not available or may have been deleted. This error also occurs when the URL is incorrect.
500 Internal Server Error : This error occurs when the server encounters an unexpected condition that prevents it from fulfilling the request.
501 Not Implemented : This error indicates that the server does not support the functionality that we've asked for in the request.

It's very helpful to have some knowledge about these common errors that may halt our scraping process. Why not begin with our article on how to solve 403 errors?

Web scraping: how to solve 403 errors

403 Forbidden error keeps reappearing? Try our workarounds.

blog.apify.com

You can also read about other Python web scraping libraries we mentioned here below.

Web scraping with Cheerio in 2023

Usama Jamil — Fri, 21 Apr 2023 13:59:22 +0000

If you're new to web scraping and need help figuring out where to start, this Cheerio web scraping tutorial is for you. We'll cover each step of web scraping from basics to advanced. Let's start our journey.

Getting started with Cheerio and web scraping

Web scraping is basically an automated process of extracting data from different websites using a bot or software. Once fetched, the data is stored in a database like an Excel sheet, JSON, or XML.

It's an outstanding method for researchers and business people who need to gather large amounts of data quickly and efficiently. For instance, an e-commerce website's owner can use web scraping to keep an eye on the pricing information of a competitor's website.

Cheerio Scraper - HTML scraping tool · Apify

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

apify.com

What is Cheerio?

Now, what is Cheerio all about? Well, Cheerio is JavaScript technology used for web scraping in server-side implementations, and it's designed explicitly for Node.js. It's a lightweight library that allows you to crawl web pages and extract data using CSS-style selectors. Cheerio can load HTML as a string and returns an object to extract data using its built-in methods.

Web scraping in Node.js with Axios and Cheerio

Using Axios and Cheerio in Node.js. With code examples.

blog.apify.com

One of the best things about Cheerio is that it runs directly in Node.js without a browser, making it faster and more efficient than other web scraping tools. It was first released in 2010 by Matt Mueller and has gained significant popularity among developers due to its versatility and ease of use.

Why use Cheerio?

Let's explore the reasons to use Cheerio:

Its jQuery-like syntax makes HTML manipulation and data extraction easier for developers familiar with jQuery.
It's super easy to set up, even for beginner developers.
Its modularity allows us to extend it with Node modules and customize it according to our needs.
The best thing about Cheerio is it doesn't require any browser for data extraction. Because it's designed explicitly for Node.js, we can use it on the server side without any worries.

That's enough for an introduction to web scraping and Cheerio. Let's get our hands dirty and jump into the environment.

Prerequisites

Before diving into the code, there are a few prerequisites for learning Cheerio.

An IDE installed
Basic knowledge of JavaScript and Node.js
Basic knowledge of Devtools

Setting up the Cheerio environment

Now we need to set up our development environment. The good news is that it's super easy!

How to install Node.js properly in 2023

Skip nvm and use fnm - here's how to do it right.

blog.apify.com

First, we'll need to install Node.js on our local machine. For that, head to the Node.js website and download the appropriate installer for your operating system.

Node.js website

Once Node.js is installed, we'll use a package manager called npm to install and manage all the third-party libraries in Node.js. Luckily, npm comes bundled with Node.js, so we don't need to install anything extra.

We've set up the Node js and npm. We can create a new Node js project using the command line interface. From there, we can add Cheerio js as a dependency and start using it to extract data and manipulate HTML easily.

mkdir webscraper         
cd webscraper            
npm init -y               
npm install cheerio

We can use simple commands to ensure we have Node.js and Cheerio installed correctly.

 node -v

 npm list cheerio

You can review the installation process and fix the issue if anything goes wrong, such as downloading the wrong Node.js installer, network connectivity, missing dependencies, etc.

Next, we need to specify the module format. We will use ES modules in this tutorial because it allows us to use modern JavaScript features, such as await , that will be used in the later section of this tutorial. To specify the format, we'll go to the package.json and add the following field:

"type": "module",

Let's do that:

package.json

We've completed our installations and made the environment ready. Now let's learn about the Cheerio API.

The Cheerio API

The Cheerio API refers specifically to the set of methods and functions provided by the Cheerio library. These methods help you to extract and modify HTML very quickly.

The selector API in Cheerio can be accessed through the $() method, which has the following structure:

$(selector, [context], [root])

It takes three arguments: the first is compulsory, and the other two are optional.

The selector argument specifies the elements you want to select from the HTML document. It can be a string, DOM element, array of elements, or Cheerio objects.
The optional context argument is actually the scope of where to begin looking for the target elements. It can be a string, DOM element, array of elements, or Cheerio objects. If no context is specified, it searches the entire document.
The optional root argument is usually the markup string you want to traverse or manipulate. It can be used to specify a different root element for the selected elements.

Once the elements have been selected, we can use other functions to manipulate them. For example, use the attr() function to get or set the value of an attribute or the text() function to get or set the text content of an element.

How to Scrape web pages with Cheerio

Now it's time to dive into some practical examples of using Cheerio for web scraping and HTML parsing. So, for that, create an index.js file in your directory or by using the command line.

touch index.js

First, we'll need to load the HTML or XML document we want to parse using the load function.

Load the HTML

We can load the HTML using the load function. This function takes a string containing the HTML as an argument and returns an object.

// import the library
import cheerio from 'cheerio';
//Load the HTML string.
const $ = cheerio.load(`
  <body>
    <h1>Hello from Cheerio</h1>
  </body>
`);

📔 Note: The resulting object is assigned to the variable $, which is a common convention used to refer to jQuery objects in JavaScript.

After executing this line of code, we can manipulate the HTML by calling methods on the $ object provided by Cheerio.

Cheerio selectors

Cheerio makes it easy to select elements using CSS-style selectors. It allows us to select elements based on tag, class, and attribute values.

Select elements with a tag name

The tagName selector allows us to select elements with a specific tag name. For example, to select all h3 tags in the document, we can use the selector like this:

import cheerio from 'cheerio';
const $ = cheerio.load(`
    <body>
       <h3>I'm learning Cheerio</h3>
       <h3>It's super easy</h3>
    </body>
`);
const $divItems = $('h3');
console.log($divItems.text()); //Output: I'm learning Cheerio It's super easy

The selector 'h3' gets all the <h3> elements from the document and returns a Cheerio object stored in the $divItems object.

📔 Note: The .text() method in Cheerio is used to extract the text content of an HTML element. It's used on a Cheerio object representing a single element or a collection of elements.

Select elements with a class name

We can also select elements with their class names using the '.className' selector. Let's take an example to get all the elements with a class name classSelector.

import cheerio from 'cheerio';
const $ = cheerio.load(` 
    <body>
        <h3 class="classSelector"> Learning platforms: </h3>
        <ul>
            <li class="classSelector">Apify</li>
            <li>Coursera</li>
            <li class="classSelector">Udemy</li>
            <li>Freecodecamp</li>
        </ul>
    </body>
`);
const $selection = $('.classSelector'); //Select the classSelector class
console.log($selection.text()); //Output: Learning platforms: Apify classSelector

Select elements with an attribute value

We can select an element with its attribute using the '[atrribute]' selector. The value of the attribute filters out attributes further.

import cheerio from 'cheerio';
const $ = cheerio.load(`
    <body>
      <h3>Terms and Conditions: </h3>
      <form> 
        <button name="Accept" type="submit">Accept</button>
        <button name="Reject" type="submit">Reject</button>      
      </form>
    </body>
  `);
const $selectedElements = $('[name]'); //Selects both the buttons
console.log($selectedElements.text());

📔 Note: To select the first element, we can specify the value of the attribute like $('[name=Accept]').

The attribute selectors can be used with any HTML attribute, not just data-* attributes. For example, $('img[alt="example"]') will select all img elements with an alt attribute having a value of "example".

Selecting elements with a combination of Cheerio selectors

Combining Cheerio selectors is an effective way to select specific elements from an HTML. Here are a few examples:

Selecting an element with a particular tag and class:

$('h3 .class ')

Selecting all the elements that match a selector:

$('ul > li')

Selecting the next sibling element:

$('p + ul')

Selecting all elements that match multiple selectors:

$('h1, h2, h3')

Selecting elements based on their position in the document:

$('li:nth-child(odd)') //This selects all odd-numbered li elements in the document.

These are just a few examples of how selectors can be combined to select elements that match multiple criteria. By using various combinations of selectors, we can select very specific elements within an HTML document.

Traversing the DOM

Cheerio provides methods for navigating in any direction of the selected elements. For example, find the child or parent of any selected element. Let's discuss them one by one.

find

The find method helps us to further filter out the selected elements based on any selector. It takes a selector as an argument and returns a new group of elements based on that.

import cheerio from 'cheerio';
const $ = cheerio.load(`
  <div class="parent">
    <p>Hello, world!</p>
    <span>How are you?</span>
    <p class="day">Have a nice day!</p>
  </div>
`);
const welcome = $('.parent').find('.day');
console.log(welcome.text()) //Output : Have a nice day!

In the example above, first, we selected the element with the parent class, and then from that element, we found an element with a class name .day.

children

The children method allows us to select the children of any selected element. Let's take an example.

import cheerio from 'cheerio';
const $ = cheerio.load(`
  <div class="parent">
    <p>Hello, world!</p>
    <span>How are you?</span>
    <p class="day">Have a nice day!</p>
    <div>
      <span class="day">
          Granchildren of ".parent" class
        </span>
      </div>
  </div>
`);
const welcome = $('.parent').children('span');
console.log(welcome.text()) //Output: How are you?

If there's no argument passed to the children function, it will return all the child elements.

📔 Note : find searches for matching descendant elements at any level below the selected element, whereas children only looks for direct child elements of the selected element.

Traversing functions

There are many other methods, but we'll not go into all the details. Let's just have a quick look at these:

contents: This method works just like children; additionally, it also selects the comments from the HTML string.
parent: It gives us the parent of the selected element.
parents: It gives us all the parents of the element till the root element.
parentsUntil: We can specify the limit of the parents using this method and how far we want to go upwards.
closest: It allows us to select the nearest parent element of a specific type that matches a given Cheerio selector. For example, $('.child').closest('div');
next and prev: The next method allows us to select the next element and the prev method gives us the previous element.

There are so many other methods provided by Cheerio. If you're interested in learning more, you can see the documentation here.

How to loop over elements

If we recall, JavaScript methods like each, map, etc., facilitate looping over elements to perform specific operations. For example, let's look at each function and see how it works.

import cheerio from 'cheerio';
const $ = cheerio.load(`
  <body>
      <h3 class="classSelector"> Learning platforms: </h3>
      <ul>
          <li class="classSelector">Apify</li>
          <li>Coursera</li>
          <li class="classSelector">Udemy</li>
          <li>Freecodecamp</li>
      </ul>
  </body>` );
const listItems = $('li')
listItems.each((index, element) => {
   console.log($(element).text());
 });

The each method takes a callback function as an argument. It has two parameters: the first one is the index starting from zero, and the second is the current element.

Selecting elements using regular expressions

To use regular expressions with Cheerio, we can use the .match() function provided by the JavaScript. The following example finds all the email addresses that have @ in an HTML document:

import cheerio from 'cheerio';
const $ = cheerio.load(`
    <body>
      <ul class="email">
        <li>Apify@gmail.com</li>
        <li>ondra@gmail.com</li>
        <li>Chris</li>
        <li>amanda@gmail.com</li>
        <li>notAnEmail</li>
        <li>notAnEmailAtAll</li>
        <li>Queue</li>
        <li>jamesbond@gmail.com</li>
      </ul>
    </body>`);
const emails = []
const userNames = $('.email li'); //Get the emails
userNames.each((index,el) => { // Iterating over the emails
  const regex = /@/; // Expression to be matched
  const email = $(el).text().match(regex); // Match the text of each li item with the expression
  if(email){
    emails.push(email.input) //Push, if the return value is not NULL
  }
});
console.log(emails);

📔 Note : The .match() function returns an object if it matches with an expression or returns null if it doesn't. That's the reason we're using email.input to get the text from the object.

Filtering elements

With filtering, we can select only the specific elements we want and ignore the rest. Cheerio provides several methods for filtering elements within a selection, like filter, not, has, etc. Let's go through them one by one.

filter

The filter method in Cheerio works just like the filter method of JavaScript. It lets us cherry-pick elements based on a specific selector. It's super handy when we're dealing with large amounts of data and want to narrow down the focus.

Let's look at an example of filtering li with a specific class.

import cheerio from 'cheerio';
var $ = cheerio.load(`
<ul class="birds">
  <li class="parrot">
    <span class="sharp">Parrots</span> are beautiful birds with <span class="beaks">sharp beaks</span>
  </li>

  <li class="sharp">
    They are superfast.
  </li>

  <li class="crow">
    Crows are smart
  </li>

</ul>
`);
var $selectedElements = $('li .parrot'); // Get a 'li' with 'parrot' class
var $parrot = $('span').filter('.sharp'); //. Filter the span elements with a className 'sharp' 
console.log($parrot.text());

not

The not method is opposite to the filter method. With this clever tool, we can easily exclude elements we don't want and focus on the important ones. It can save us much time and effort, especially when dealing with extensive data.

Let's take the previous example, but we're not going to select the ones with the sharp class this time.

import cheerio from 'cheerio';
var $ = cheerio.load(`
<ul class="birds">
  <li class="parrot">
    <span class="sharp">Parrots</span> are beautiful birds with <span class="beaks">sharp beaks</span>
  </li>

  <li class="sharp">
    They are superfast.
  </li>

  <li class="crow">
    Crows are smart
  </li>

</ul>
`);
var $selectedElements = $('li .parrot') // Get a 'li' with 'parrot' class
var $parrot = $('span').not('.sharp'); // Exclude elements with a className 'sharp'
console.log($parrot.text()); // Output: sharp beaks

has

While web scraping with Cheerio, you might want to find elements that contain a specific child element, like a span or an image. The has method works the same way. It takes a Cheerio selector as an argument and returns a child element that matches the selector.

For example, we're searching for products that are on a discount.

import cheerio from 'cheerio';
const script = `
<div class="product">
  <h2 class="name">Product 1</h2>
  <span class="price">$10</span>
  <span class="discount">20% off</span>
</div>
<div class="product">
  <h2 class="name">Product 2</h2>
  <span class="price">$20</span>
</div>
<div class="product">
  <h2 class="name">Product 3</h2>
  <span class="price">$30</span>
  <span class="discount">10% off</span>
</div>`;
const $ = cheerio.load(script);
const discountedProducts = // search for a parent element that has a child with a 
$('.product').has('.discount'); // class name discount

console.log(discountedProducts.text())
//Output : Product 1
// $10
// 20% off

// Product 3
// $30
// 10% off

This is handy when web scraping eCommerce websites and looking for discounted products. It allows you to filter out the products that have a discount quickly.

eq

The eq works just like an array indexing from the selected elements. It allows us to select an element from a specific index.

Let's take an example where we select the second element from li elements.

import cheerio from 'cheerio';
const $ = cheerio.load(`
    <ul>
      <li>Item 1</li>
      <li>Item 2</li>
    </ul>
`);
  const secondItem = $('li').eq(1);
  console.log(secondItem.text()); //Output: Item 2

📔 Note: It doesn't throw an error if we provide the index that's out of bounds; rather it simply returns nothing. For example, if we provide 4 to the .eq function, it will return nothing.

first and last

The methods first and last work the same way as accessing elements from the array. The first method returns the first element from the selection, and the last method returns the last element.

Let's take an example where we'll select the last and first element of the ol.

import cheerio from 'cheerio';
const $ = cheerio.load(
    `<ol>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
    </ol>`
  );
  const firstItem = $('li').first();
  const lastItem = $('li').last();
  console.log(firstItem.text()," ",lastItem.text()) //Output: Item 1 Item 3

What would happen if the selection contained just one element? The first and last elements would be the same.

So far, we've learned to extract elements from the HTML document, but what if we want the retrieved data to be structured and well formatted? Let's find out how to do that.

How to extract data from HTML tags using Cheerio

We've learned to extract data from HTML documents in a simple way. However, what if the data you need to extract is large and needs to be structured? That's where Cheerio's objects come in. They allow us to extract data from the HTML in a structured way.

You can specify what data you want to extract by passing the keys and values to the object. The keys in the map object represent the names of the properties you want to create, while the values are the Cheerio selectors you'll use to extract the data.

Here's an example:

import cheerio from 'cheerio';
const $ = cheerio.load(`
    <div class="book-info">
        <h2 class="book-title">The Great Gatsby</h2>
        <p class="author">By F. Scott Fitzgerald</p>
        <p class="release-date">Released: April 10, 1925</p>
        <p class="price">Price: $12.99</p>
    </div>
`);
const structuredData = {
    Book_Title: $('.book-title').text(),
    Author: $('.author').text(),
    Release_Data:$('.release-date').text(),
    Price: $('.price').text()
};
console.log(structuredData);

We're creating a JavaScript object structuredData with properties Book_Title , Author , Release_Data , and Price . We're using the class selectors to select all the elements. The output will be an object with the extracted data:

{
  Book_Title: 'The Great Gatsby',
  Author: 'By F. Scott Fitzgerald',
  Release_Data: 'Released: April 10, 1925',
  Price: 'Price: $12.99'
}

What if we have multiple books and want to get all of them from the document? Let's look into that.

import cheerio from 'cheerio';
const $ = cheerio.load(`
    <div class="book-info">
        <h2 class="book-title">The Great Gatsby</h2>
        <p class="author">By F. Scott Fitzgerald</p>
        <p class="release-date">Released: April 10, 1925</p>
        <p class="price">Price: $12.99</p>
    </div>
    <div class="book-info">
        <h2 class="book-title">To Kill a Mockingbird</h2>
        <p class="author">By Harper Lee</p>
        <p class="release-date">Released: July 11, 1960</p>
        <p class="price">Price: $10.99</p>
    </div>
    <div class="book-info">
        <h2 class="book-title">1984</h2>
        <p class="author">By George Orwell</p>
        <p class="release-date">Released: June 8, 1949</p>
        <p class="price">Price: $9.99</p>
    </div>
`);
const books = [];
const books1= $('.book-info')
books1.each((index,book)=>{
    const structuredData = {
        Book_Title: $(book).find('.book-title').text(),
        Author: $(book).find('.author').text(),
        Release_Data:$(book).find('.release-date').text(),
        Price: $(book).find('.price').text()
    };
    books.push(structuredData)
}) 
console.log(books);

The output of the code is an array of objects:

  [
   {
    Book_Title: 'The Great Gatsby',
    Author: 'By F. Scott Fitzgerald',
    Release_Data: 'Released: April 10, 1925',
    Price: 'Price: $12.99'
  },
  {
    Book_Title: 'To Kill a Mockingbird',
    Author: 'By Harper Lee',
    Release_Data: 'Released: July 11, 1960',
    Price: 'Price: $10.99'
  },
  {
    Book_Title: '1984',
    Author: 'By George Orwell',
    Release_Data: 'Released: June 8, 1949',
    Price: 'Price: $9.99'
  }
]

This code uses each method to iterate over a collection of selected .book-info elements. For each element, the code creates an object structuredData that contains the extracted data. The data is extracted using the find method to search for child elements with the suitable classes.

Finally, the structuredData object is pushed to the books array using the push method. This result is an array of objects, each representing a book and containing the extracted data.

How to write extracted data in a file

What if we want to store the scraped data in our local storage? We'll use the example above to store the data in a JSON file. We can use the built-in fs module in Node. It allows us to interact with the file system on a computer. Here's how you can modify the code to write the books array to a JSON file:

import cheerio from 'cheerio';
import fs from 'fs'; // Import the "fs" module

const $ = cheerio.load(`
    <div class="book-info">
        <h2 class="book-title">The Great Gatsby</h2>
        <p class="author">By F. Scott Fitzgerald</p>
        <p class="release-date">Released: April 10, 1925</p>
        <p class="price">Price: $12.99</p>
    </div>
    <div class="book-info">
        <h2 class="book-title">To Kill a Mockingbird</h2>
        <p class="author">By Harper Lee</p>
        <p class="release-date">Released: July 11, 1960</p>
        <p class="price">Price: $10.99</p>
    </div>
    <div class="book-info">
        <h2 class="book-title">1984</h2>
        <p class="author">By George Orwell</p>
        <p class="release-date">Released: June 8, 1949</p>
        <p class="price">Price: $9.99</p>
    </div>
`);
const books = [];
const books1= $('.book-info')
books1.each((index,book)=>{
    const structuredData = {
        Book_Title: $(book).find('.book-title').text(),
        Author: $(book).find('.author').text(),
        Release_Data:$(book).find('.release-date').text(),
        Price: $(book).find('.price').text()
    };
    books.push(structuredData) //push the object to the array
})
const jsonData = JSON.stringify(books); //Convert the array to JSON format

fs.writeFile('books.json', jsonData, () => { //Using the fs.writeFile , it's used to write data in a file
  console.log('Data written to file'); //Display "Data written to file" in the call back function.
});

📔 Note : The fs.writeFile takes three arguments:

books.json : The file in which we will write data. If the file is already there, it will just write data in it; if it's not, it will first create the file and then write data in it.

jsonData : Data to be written in the file.

()=>{} : A callback function that will be called after the data has been written to the file.

How to handle errors and exceptions with Cheerio

As students of web scraping with Cheerio, we may encounter errors and exceptions as we work with the code. Don't worry: this is a common experience for most beginners. We're exploring how to handle these errors and exceptions so that you can become a more confident and successful web scraper developer.

To handle errors, we can use try-catch blocks in our code. A try block allows us to try a block of code when we are not sure if the code will execute or not and catch any errors that might occur. If an error occurs, the catch block will execute, allowing us to handle the error appropriately.

Here's an example of how to use try-catch blocks with Cheerio:

try {
  const $ = cheerio.load(html);
} catch (err) {
  console.log(err);
  }

This code uses a try-catch block to handle errors that might occur while loading an HTML document. Overall, this code ensures that the program does not crash if an error occurs during the loading of the HTML document and provides a way to handle the error.

Congratulations on completing the basics of web scraping! It's been an exciting journey so far, and now we're ready to take the next step and dive into extracting the HTML of a website. But that's just the tip of the iceberg - there's so much more to learn. So let's continue this journey together and see how we can interact with actual websites.

How to use Axios with Cheerio

Have you ever wondered how we get the HTML document of a web page? Axios allows us to make HTTP requests and get the HTML document of a website.

To use Axios in our project, we need to install it in our environment by entering the following command:

npm install axios

If you're interested in learning more about Axios with Cheerio, go to this blog post and learn more.

Let's use the Axios library in the following example.

Scraping multiple pages using Cheerio

It's worth noting that there's a difference between pagination and scraping multiple pages. Pagination refers to dividing content into separate pages, often with navigation links to move between them. On the other hand, scraping multiple pages involves extracting data from many pages of a website, whether they're paginated or not.

Why does this matter? When scraping paginated content, we can use the navigation links to move between pages and extract data from each page. However, when scraping multiple pages that are not paginated, we'll require a different approach to identify and extract the data we need.

Let's see how we can scrape multiple pages with Cheerio.

Here we can see in the link that the website is using pagination, and each Actor is at a different link. We can use Cheerio to extract different Actors from different pages using a loop. Let's do this.

import axios from 'axios';
import cheerio from 'cheerio';
const Actors = []; //Declare an empty actors array to store actors' data.
for (let i = 0; i < 5; i++) { // It means we will change the last digit of the url to get the data of another actor
    const url = `https://www.imdb.com/name/nm000015${i}`; //Change the last digit of the data
    const response = await axios.get(url, { headers }); // Add headers to axios request
    const $ = cheerio.load(response.data);
    const name = $('h1 .sc-afe43def-1').text(); // Get the name of the actor
    const Date_of_Birth = $('.sc-dec7a8b-2').eq(1).text() // Get the Date of Birth
    Actors.push({ name, Date_of_Birth });
}
console.log(Actors);

But when we try to access the website to fetch data, we get a 403 error because the server has detected that the request is coming from an automated script. It's a very common anti-scraping technique used by most modern websites. So, we need to find a way to tackle this problem. One way is to use headers in our Axios request to act like an actual browser request. In this example, we'll use a user agent that indicates that the request is made from an actual browser rather than an automated script.

Let's see how we can do this.

import axios from 'axios';
import cheerio from 'cheerio';
const Actors = []; //Declare an empty actors array to store actors' data.
for (let i = 0; i < 5; i++) { // It means we will change the last digit of the url to get the data of another actor
    const url = `https://www.imdb.com/name/nm000015${i}`; //Change the last digit of the data
    const response = await axios.get(url, { headers }); // Add headers to axios request
    const $ = cheerio.load(response.data);
    const name = $('h1 .sc-afe43def-1').text(); // Get the name of the actor
    const Date_of_Birth = $('.sc-dec7a8b-2').eq(1).text() // Get the Date of Birth
    Actors.push({ name, Date_of_Birth });
}
console.log(Actors);

Web scraping: how to solve 403 errors

403 Forbidden error keeps reappearing? Try our workarounds.

blog.apify.com

Challenges of web scraping with Cheerio

Like all web scraping libraries, Cheerio has its pros and cons. We have already discussed the pros of using Cheerio. Now let's discuss what challenges we can face while working with Cheerio.

Dynamic websites and JavaScript

Let's start with one of the biggest challenges of web scraping with Cheerio. It's, dealing with dynamic websites and JavaScript. Nowadays, modern websites use dynamic content and JavaScript to update or modify the page content without needing an entire page to reload. That can make it tough to scrape the website with Cheerio alone. The content may not be fully loaded or loaded asynchronously after the initial page load.

When Cheerio loads an HTML page, it only sees the static HTML content. Any JavaScript or dynamic content that's loaded later is hidden from Cheerio. That can be a letdown if the website you're trying to scrape uses JavaScript to load or modify content.

Anti-scraping measures

Do you know why some scrapers get blocked and don't work as efficiently as they should? If we're smart enough to make scrapers scrape websites, the developers are also ready to block us and stop us from extracting data from their websites. They use different anti-scraping techniques to do this. Let's discuss a few of these:

CAPTCHAs : CAPTCHAs are tests that are designed to differentiate between human users and bots. They require users to complete a task, such as typing in a sequence of letters or identifying objects in an image.
IP blocking : Websites block IP addresses that are associated with web scraping. This means that if a particular IP address is identified as a scraper, the website will block all requests from that IP address.
User-agent detection : User-agent detection is a technique that identifies the type of browser or device that is being used to access the website. Websites can use this information to identify scrapers, as many scrapers use non-standard user agents.
Dynamic web pages : Some websites use dynamic web pages that are generated using JavaScript. These pages can be more difficult to scrape, as the content is generated on the fly and may not be present in the page source.

As a web scraper developer, it's important to be aware of these anti-scraping measures and to take steps to avoid them. This may include using rotating proxies or user agents, cookies, implementing delays between requests, and much more. You can learn about these measures here.

Performance issues

When working with large amounts of data or complex HTML structures, using intermediate results to optimize performance is common. It means that we store the results of certain operations and reuse them later rather than re-calculating them each time. jQuery can do that to optimize performance because it's designed to run in a browser with more memory.

However, Cheerio is designed to run in Node.js, which has a more limited memory than the browser. As a result, Cheerio has a hard time saving intermediate results. It can quickly run into memory issues when working with large datasets.

Without the ability to save intermediate results, Cheerio has to parse the entire HTML document to perform each operation. It can be slow and resource-intensive, especially when working with large documents.

Congratulations on completing the basics of web scraping and Cheerio! Now it's time to level up and learn a few advanced concepts of web scraping.

Web scraping: how to crawl without getting blocked

Guide on how to solve or avoid anti-scraping protections.

blog.apify.com

Scraping websites with dynamic content

Advanced websites load data at runtime, and hence it becomes difficult to extract data using Cheerio. That's where other libraries come in to help Cheerio. Let's take a look at these libraries and see how they help.

Puppeteer

Before we start learning Puppeteer, let's discuss where Cheerio fails. While Cheerio is an excellent tool, it fails when it comes to parsing websites that load data at runtime and need a browser to open them. Now, what Puppeteer does for us is that it helps us open the website in a browser known as a headless browser that acts like a browser, but in reality, it's not. Once the website is loaded using Puppeteer, we can use Cheerio to load its HTML and do whatever we want. Pretty amazing, right?

How to scrape the web with Puppeteer in 2023

Complete web scraping and crawling tutorial for Puppeteer.

blog.apify.com

Here are some of the core functions provided by Puppeteer:

puppeteer.launch([options]): This function launches an instance of Chrome or Chromium with a set of options specified in an object. It returns a Promise that resolves to an instance of the Browser class.
browser.newPage(): This function creates a new blank page in the browser instance. It returns a Promise that resolves to an instance of the Page class.
page.goto(URL, [options]): This function navigates to the specified URL. It returns a Promise that resolves when the page is loaded.
page.content(): This function returns the HTML content of the current page. It returns a Promise that resolves to a string.
browser.close() is a function in Puppeteer that is used to close the browser instance that was opened with puppeteer.launch(). It will terminate the browser process and all of its pages.

These are just some of the functions provided by Puppeteer. There are other functions available as well.

To install Puppeteer in your environment, run the following command:

npm install puppeteer

Let's take a closer look at how this works with an example. We want to extract the Top 250 Movies from the IMDb website in this example.

As we can see from the image above, all the movies are in one body tag <tbody> having a class of lister-list, and each movie is the <tr> .Now, let's start using puppeteer with Cheerio.

import puppeteer from 'puppeteer';
import cheerio from 'cheerio';
 (async () => {
   const browser = await puppeteer.launch(); //Launch the browser  
   const page = await browser.newPage(); //Open a new page 
  await page.goto('https://www.imdb.com/chart/top'); //The opened page goes to the link provided in the .goto() function
  const html = await page.content(); //Get the content of the page using the .content() function
  const $ = cheerio.load(html);
  const movies = [];
  const moviesTr = $('tbody.lister-list tr') //Select all the table rows using the Cheerio selectors
  moviesTr.each((i, movie) => { //Iterate over all the movies one by one
   const title = $(movie).find('.titleColumn a').text()
   const year = $(movie).find('.titleColumn .secondaryInfo').text()
   const rating = $(movie).find('.ratingColumn strong').text()
   movies.push({title, year, rating}); //Push the extracted data
   });
console.log(movies);
  await browser.close();
})();

Playwright

Playwright is also an open-source Node.js library that was developed by the same team that developed Puppeteer. It is a powerful and versatile alternative to Puppeteer. The Puppeteer team needed a tool that could automate not just the Chromium-based browsers but other browsers like Firefox and Safari. So, they developed Playwright, which supports other browsers as well.

How to scrape the web with Playwright in 2023

Complete Playwright web scraping and crawling tutorial.

blog.apify.com

To install Puppeteer in your environment, run the following command:

npm install playwright

Let's implement the above code with Playwright.

import { chromium } from 'playwright'; //import chromium from playwright
import cheerio from 'cheerio';

(async () => {
const browser = await chromium.launch(); // Launch the browser
const context = await browser.newContext(); // Create a new context
const page = await context.newPage(); // Create a new page
await page.goto('https://www.imdb.com/chart/top'); // Navigate to the provided URL
const html = await page.content(); // Get the page content
const $ = cheerio.load(html);
const movies = [];
const moviesTr = $('tbody.lister-list tr'); // Select all the table rows using the Cheerio selectors
moviesTr.each((i, movie) => { // Iterate over all the movies one by one
const title = $(movie).find('.titleColumn a').text();
const year = $(movie).find('.titleColumn .secondaryInfo').text();
const rating = $(movie).find('.ratingColumn strong').text();
movies.push({title, year, rating}); // Push the extracted data
});
console.log(movies);
await browser.close(); // Close the browser
})();

Playwright is a more flexible and powerful automation framework than Puppeteer, which makes it an excellent alternative for advanced web automation.

How to implement authentication

What if a website requires authentication? That's another challenge, but don't worry: we've got you covered. It's not possible to extract data from such websites without logging in. Therefore, we need to find a way to authenticate ourselves through our code to open the page from where we can extract the data.

For this purpose, we need the help of the Puppeteer library, which will do the authentication task. And obviously, our well-known tool, Cheerio, will extract the data. With Puppeteer and Cheerio, we can automate the authentication process and extract the data we need.

Let's look at an example to understand how to implement authentication with Cheerio and Puppeteer. We'll try to authenticate the newsapi website, which looks like this:

Let's dive into the code and see how this magic happens:

import puppeteer from 'puppeteer';
import cheerio from 'cheerio';
(async () => {
    const browser = await puppeteer.launch(); //Launch the browser using .launch() function
    const page = await browser.newPage(); //Open a new page using .newPage() function

    await page.goto('https://newsapi.org/login'); //The opened page goes to the link provided in the .goto() function

    await page.type('#Email', 'your_email'); //Enter the email and password in the respective input fields
    await page.type('#Password', 'you_password'); //The await waits for the website to load these fields.

    await page.click('button[type=submit]'); //Click the submit button

    await page.waitForNavigation(); //Wait for the website to finish the login process
    await page.goto('https://newsapi.org/account'); //Go to the home page
    //Now we can manipulate the website according to our needs
    const html = await page.content();
    const $ = cheerio.load(html);
    console.log($('.mb2').text()); 
    await browser.close();
  })();

How to handle asynchronous requests, errors, and retries

When web scraping with Cheerio, it's important to consider errors and retries to ensure our code is reliable and robust. As we saw earlier, error handling allows us to prevent and detect possible errors and whether the error occurs while fetching the website. It would require retries, and that's what we'll cover now. We'll see how things go from requesting a website, handling errors, and trying again. We've already seen the axios library for making HTTP requests. That same axios library allows us to retry the HTTP requests as well.

To use axios-retry, first we need to install it as a dependency and import it into our file.

npm install axios-retry

Let's see an example.

const axios = require('axios');
const axiosRetry = require('axios-retry');
axiosRetry(axios, {
  retries: 3,
  retryDelay: (retryCount) => {
    return retryCount * 2000; // multiple the retry time with 2000 miliseconds
  },
});
async function scrapeWebsite() {
  try {
    const response = await axios.get('link-to-the-website');
  } catch (error) {
    console.error(error);
  }
}
scrapeWebsite();

First, scrapeWebsite will be called, and it will try to fetch the website provided in the axios.get function, if the request fails, then the axiosRetry function will be called. It will try to fetch the data three times, and the time between each try increases to a multiple of 2,000 milliseconds. The function will return an error if it gets nothing after three attempts.

How to use a testing framework with Cheerio

It's important to test your code thoroughly before it goes into production. By testing with various inputs and outputs, we can be confident that the code works correctly and meets our requirements. Deploying code without proper testing can lead to unexpected problems, which can cause delays and additional costs to the project. Therefore, testing the code is a good practice to ensure the quality and reliability of the software.

Let's see how we can use a framework to test our Cheerio codes.

Test code with Jest

Jest is a popular framework developed to test JavaScript applications. It was developed by Facebook, and one of its key features is that it can run tests in parallel, which makes it super fast and efficient. It's easy to set up and test the codes.

Let's see how we can add Jest to our environment.

Open the terminal and enter the command below:

npm install --save-dev jest

We have passed this --save-dev optional argument to make that this dependency is only needed in the development phase.

Open the package.json and replace:

"test": ""

with the following:

"test": "node --experimental-vm-modules node_modules/.bin/jest"

Write the tests in the .test file and enter the command below to see the results:

npm test

How to write a test with Jest

Let's use some code that we wrote earlier in this Cheerio tutorial.

function getBookInfo(htmlString) {
  const $ = cheerio.load(htmlString);
  const structuredData = {
    bookTitle: $('.book-title').text().trim(),
    author: $('.author').text().trim(),
    releaseDate: $('.release-date').text().trim(),
    price: $('.price').text().trim()
  };
  return structuredData;
}
export default getBookInfo;

We have a function named getBookInfo that accepts an HTML string as an input and returns an object named structuredData. In the end, we're exporting the function so that we can test it in our test.js file.

Let's see how well this code is working. We'll open our IDE and create a file name structureData.test.js and write the following code.

import getBookInfo from './index'; //Import the function getBookInfo
test('getBookInfo returns the correct book information', () => {
  const htmlString = `
    <div class="book-info"> //HTML string that will be used for testing
      <h2 class="book-title">The Great Gatsby</h2>
      <p class="author">By F. Scott Fitzgerald</p>
      <p class="release-date">Released: April 10, 1925</p>
      <p class="price">Price: $12.99</p>
    </div>`;
//Declare a test having two arguments. The first argument is the description,
//The second argument is an anonymous function that will return if the test fails or not.
  const bookInfo = getBookInfo(htmlString);
  expect(bookInfo).toEqual({
    bookTitle: 'The Great Gatsby',
    author: 'By F. Scott Fitzgerald',
    releaseDate: 'Released: April 10, 1925',
    price: 'Price: $12.99'
  });
});
//The .expect() function takes the results as an argument
// .toEqual() compares the results with the desired outcome.

Like this, we can write as many tests using Jest as we want for our code, like missing values, extra spaces, etc.

Playwright testing: how to write & run E2E tests properly

How to write and run E2E tests the proper way 💡

blog.apify.com

Finding and fixing bugs in your Cheerio code

While creating scraping scripts and manipulating the DOM, we might encounter some challenges, but we have tools and strategies to make the process easier.

Unit tests : Testing the code before deployment and ensuring it works correctly can save us from major problems. Unit testing ensures that we get the expected results. Different technologies like Jest and Mocha are used for this.
Console.log : The console.log is often used for debugging programs by writing it at various positions in the code. The same is used to troubleshoot Cheerio programs.

Debug using the console.log

Browser developer tools : The browser developer tools can be used to inspect the DOM and spot problems in the code. Such tools include Chrome DevTools.

Debug using the inspect element

Best practices for web scraping with Cheerio

The following approaches can be used to optimize our code and improve performance.

Handling dynamic content : As we mentioned earlier, dynamic content creates a lot of issues while extracting data, so it's a good practice to always keep in mind that Cheerio may not help us in this scenario. We need to use other libraries to load the HTML and perform operations using Cheerio.
Handling complex selectors : It can be tricky to work on websites that use complex selectors, i.e., nested selectors. It's recommended to break down the selectors and select elements very carefully.
Handling version compatibility issues : Cheerio has different versions that may not be compatible with specific versions of Node.js or other libraries. Check compatibility before using Cheerio, and update to the latest version if necessary.

You can visit this great blog post to learn more about scraping websites more efficiently using Cheerio.

Alternatives to Cheerio

Let's explore some of the alternatives to web scraping with Cheerio. We'll look at the pros and cons of these libraries as well.

Library	Advantages	Disadvantages	Maintenance/Up-to-date?
Cheerio	Lightweight, fast, and easy to use	Limited functionality for dynamic web pages	Maintained and up-to-date
Puppeteer	Full-fledged automation tool with Chrome DevTools integration	Requires more setup than Cheerio, slower, resource-intensive	Maintained and up-to-date
JSDOM	Lightweight, allows for easy DOM manipulation	Limited functionality for dynamic web pages	Maintained and up-to-date
NightmareJS	High-level API, supports multiple browsers	Slower than Puppeteer, outdated and not maintained	Outdated and not maintained
PhantomJS	Lightweight, supports multiple browsers	Outdated and not maintained	Outdated and not maintained
Playwright	Multi-browser support, faster than Puppeteer	Requires more setup than Cheerio, resource-intensive	Maintained and up-to-date
node-html-parser	Supports parsing of dynamic HTML, easy to use	Limited functionality for web automation	Maintained and up-to-date
got-scraping	Lightweight, fast, and easy to use	Developed specifically to address drawbacks of modern scraping tools	Maintained and up-to-date

Let's take a peek at the downloads of these packages as well.

Conclusion

Cheerio is a robust and adaptable framework with an easy-to-use API for parsing and manipulating HTML. With its jQuery-like syntax, extracting data from web pages and manipulating the HTML to meet your needs is simple.

Although Cheerio is an excellent scraping tool in many cases, it does have its challenges, such as anti-scraping measures, dynamic websites, and performance issues. Yet, refined approaches and tools can assist you in overcoming these challenges and achieving your web scraping objectives.

As you may have observed, this blog post by no means tries to pitch Cheerio as the ultimate scraping tool. Cheerio has its fair share of shortcomings and areas to improve. Having said that, Cheerio still looks pretty promising. Also, it's always easy to switch to alternative tools after mastering it.

As the saying goes, Perpetuam Uitae Doctrina (Life long learning). Complimenting this tutorial with practice is essential. Also, a good tutorial is neither perfect nor the only source of information on the subject. So, if you have any feedback for improvement, please let us know. Happy learning!

Frequently asked questions

What is Cheerio js?

Cheerio js is an easier and more efficient way to extract data in Node.js. It's a lightweight library that allows us to crawl web pages and extract data using CSS-style selectors. It enables us to load HTML as a string and returns an object that can be used for extracting data.

What are the benefits of Cheerio?

Cheerio's jQuery-like syntax is a big advantage for many developers. Setting it up is easy compared to other tools. Even beginners can get started with just a little bit of configuration. Cheerio is also modular and can be extended with Node.js modules. Another big advantage of Cheerio is that it requires no browser.

How do you install Cheerio in Node.js?

Simply head over to your working environment and open the terminal. Enter the command npm install cheerio. Create a new Node js project using the command line interface. From there, add Cheerio js as a dependency and start using it to extract data and manipulate HTML easily.

How do you scrape data from a single web page using Cheerio?

You can specify what data you want to extract by passing the keys and values to the object. The keys in the map object represent the names of the properties you want to create on the object, while the values are the Cheerio selectors you'll use to extract the data.

How do you use Puppeteer with Cheerio?

Use Puppeteer to open the website in a headless browser. Once the website is loaded using Puppeteer, you can use Cheerio to load its HTML.

How do you scrape data from multiple pages using Cheerio and pagination?

When scraping paginated content, you can use the navigation links to move between pages and extract data from each one. However, when scraping multiple pages that aren't paginated, you'll need to use a different approach to identify and extract the data you need.

How do you handle errors while scraping with Cheerio?

To handle errors, we can use try-catch blocks in our code. A try block allows us to try a block of code when we're not sure whether the code will execute and catch any errors that might occur. If an error occurs, the catch block will execute, allowing us to handle the error appropriately.

How do you write scraped data to a local file using Cheerio?

Store the data in a JSON file. You can use the built-in fs module in Node.js, which lets you interact with the file system on a computer.

DEV Community: Usama Jamil

Web scraping with Python Requests

Why is Python used for web scraping?

What are the best Python tools and libraries for web scraping?

Getting started with Python Requests

How to set up Python Requests?

How to send HTTP requests with Requests?

Handling HTTP responses with Requests

How the Requests library handles different types of data

JSON

Binary data

How the HTML data is parsed and extracted

What are the challenges of web scraping with Python Requests?

Advanced web scraping techniques

How to handle Cookies with Python Requests?

How to authenticate using Python Requests?

How does session management work in Requests?

How to use proxies with Python Requests?

User-agents

Throttling and rate limiting

Testing and debugging web scraping with Requests

What are common errors in web scraping?

Further reading

Web scraping with Cheerio in 2023

Getting started with Cheerio and web scraping

What is Cheerio?

Why use Cheerio?

Prerequisites

Setting up the Cheerio environment

The Cheerio API

How to Scrape web pages with Cheerio

Load the HTML

Cheerio selectors

Traversing the DOM

How to loop over elements

Selecting elements using regular expressions

Filtering elements

How to extract data from HTML tags using Cheerio

How to write extracted data in a file

How to handle errors and exceptions with Cheerio

How to use Axios with Cheerio

Scraping multiple pages using Cheerio

Challenges of web scraping with Cheerio

Dynamic websites and JavaScript

Anti-scraping measures

Performance issues

Scraping websites with dynamic content

How to implement authentication

How to handle asynchronous requests, errors, and retries

How to use a testing framework with Cheerio

Test code with Jest

How to write a test with Jest

Finding and fixing bugs in your Cheerio code

Best practices for web scraping with Cheerio

Alternatives to Cheerio

Conclusion

Frequently asked questions

What is Cheerio js?

What are the benefits of Cheerio?

How do you install Cheerio in Node.js?

How do you scrape data from a single web page using Cheerio?

How do you use Puppeteer with Cheerio?

How do you scrape data from multiple pages using Cheerio and pagination?

How do you handle errors while scraping with Cheerio?

How do you write scraped data to a local file using Cheerio?