Oyedele Temitope

Posted on Nov 20, 2023

Web Scraping in Node.js Using Axios,Cheerio and Json2csv

#webdev #webscraping #node #javascript

Web scraping is a powerful technique used to extract data from websites. In this tutorial, we'll explore how to perform web scraping using Node.js, Axios for making HTTP requests,Cheerio for parsing HTML content and also json2csv for converting json data to csv. We'll scrape product data from a sample website, "https://scrapeme.live/shop/".

Prerequisites

To effectively follow this tutorial on web scraping in Node.js, ensure the following requirements are met:

Node.js Installation: You must have Node.js installed on your machine. If not, you can download it from the official website.
JavaScript and Node.js Knowledge: Understanding JavaScript and Node.js is crucial for comprehending the code examples and implementing web scraping scripts.
DOM (Document Object Model) Understanding: Familiarity with how the Document Object Model (DOM) works is important, as web scraping often involves interacting with HTML elements in a structured document.
Code Editor: Have a code editor installed on your system. Visual Studio Code is recommended for its versatility and features, providing an excellent environment for running and debugging Node.js code.
Optional: jQuery Knowledge: While not mandatory, familiarity with jQuery can be advantageous. Cheerio, a library used for HTML parsing in Node.js, shares a similar syntax to jQuery, making the learning curve smoother.

What is Web Scraping

Web scraping is the process of extracting data from websites. It involves fetching and parsing HTML content to gather information, providing a structured and accessible format for further analysis.

While web scraping is a powerful tool for data extraction, it's essential to be mindful of ethical considerations and respect the terms of service of the websites being scraped.

How It Works

Web scraping involves a series of well-defined procedures to efficiently gather data from websites. The key steps in the web scraping process are:

Send HTTP Request: Initiate a request to the target URL to retrieve the HTML content of the webpage.
Receive Server Response: The server responds with the HTML content of the webpage, containing the structure and data to be scraped.
Parse HTML Content: Utilize parsing libraries such as Cheerio (for JavaScript) to convert the raw HTML into a structured format that can be easily navigated.
Extract Data: Identify and extract specific data elements by selecting and manipulating HTML elements using parsing library methods.
Save Extracted Data: Store the extracted data in a chosen format, such as JSON, CSV, or a database, to facilitate further analysis or usage.

WebHarvy, in one of their articles, provides a great illustration of how web scraping works:

Web Scraping Use Cases

Below are some use cases of web scrapping:

Competitive Intelligence: Web scraping allows businesses to continuously track competitors' pricing, product catalogs, marketing messaging, and technical capabilities. For example, a retailer can scrape competitors' product pages to monitor real-time pricing changes. If a rival lowers prices, they can quickly adjust their pricing to stay competitive.
E-Commerce Data Monitoring: Any data scraped from e-commerce platforms and marketplaces is considered e-commerce data. In practice, scraping data from e-commerce platforms like eBay or Amazon is more complex than it may seem due to anti-scraping mechanisms, dynamic platform changes, cloaking, etc.
Financial Data Monitoring:Web scraping helps to analyze current market conditions, uncover market changes, calculate possible risks, and monitor local and global news for stock market insights.
Price and Product Intelligence: Businesses often use web scraping to gather information about competitors' prices and products, such as available stock or product descriptions.

Now that we have an understanding of what web scrapping is, how it works and its use cases, let's move on to our project which is scrapping data from a website and then saving it in a CSV format.

Building the Project

To scrap a website, we need to farmiliarize ourselves with selectors. They play a fundamental role in identifying specific elements within the HTML structure of a webpage. During the scraping process, these elements are subsequently targeted for manipulation.

To demonstrate this concept, let's look at how CSS selectors work. Open the link from which we'll be scraping data in your browser: https://scrapeme.live/shop/. Right-click on the element you wish to select and choose "Inspect" from the context menu. Alternatively, you can open the Developer Tools by pressing Ctrl+Shift+I (or Cmd+Option+I on Mac).

In the Elements panel of the Developer Tools, locate the element you intend to select. The chosen element will be highlighted in the Elements panel and webpage.

Right-click on the selected element in the Elements panel. Hover over the "Copy" option in the context menu, and you'll see various options to copy the selector, including "Copy selector". Click on "Copy selector" to copy the selector to your clipboard like so:

The image above shows copying the selector of the data we want to scrape from the headers. Upon copying the selector, we have something like this #masthead > div.col-full > div.site-branding > div. This selector is valid and works well; the only issue is that this method creates a brittle selector.

A brittle selector means that changes in the layout would require you to adjust the selector, as it is highly specific — tied to the exact position and structure of the HTML element. Sometimes, this specificity makes it challenging to maintain the code.

An effective way to solve this is to use less specific selectors. This is a good to make your selectors more robust. By selecting elements based on their classes or attributes rather than their precise location in the HTML structure, you can reduce the likelihood of selector breakage when the layout or structure of the web page changes.

Alternatively, a third-party tool, such as the selector gadget extension for Chrome, can be used to quickly create selectors. To understand more about selectors, w3schoolshas a good CSS reference page to learn CSS selectors.

Let's get started with creating our project. Navigate to the folder where you want to build your project and create a package.json file.



npm init -y

This command creates a package.json file in your project directory, which keeps track of important project information and dependencies. The -y flag tells npm to use default values for all the options, which can save time when you're setting up a simple project and don't need to customize these details

Install the necessary node.js packages needed for this project. For node.js web scraping, we need to use certain packages, also known as libraries.



npm install axios cheerio json2csv

This npm command will install into your project. Let's break down what each package is for:

Axios

Axios is a popular JavaScript library for making HTTP requests. It simplifies the process of sending asynchronous HTTP requests to APIs or fetching data from web pages.

In the code, Axios is used to make an HTTP request to the specified URL and retrieve the HTML content of the web page.

Cheerio

Cheerio is a fast and flexible jQuery implementation for parsing HTML content on the server side. It provides a convenient way to traverse and manipulate the HTML structure using a familiar syntax similar to jQuery.

In the code, Cheerio loads and parses the HTML content obtained from the Axios request. It allows the code to select and extract specific elements (product titles and prices) from the HTML.

json2csv

json2csv is a package that facilitates converting JSON (JavaScript Object Notation) data into CSV (Comma-Separated Values) format. This is useful when storing or exporting structured data in a CSV file.

After scraping and collecting product data in JSON format, json2csv is used to convert this data into a CSV format. The resulting CSV data is then written to a file for further analysis or storage.

The first to do is define the const that will hold a reference to axios, cheerio, and other packages. Create a file (e.g., scrape.js) and paste the provided code into it.



   const cheerio = require("cheerio");
   const axios = require("axios");
   const j2csv = require("json2csv").Parser;
   const fs = require("fs");

Here, we import the required Node.js modules (cheerio, axios, json2csv, and fs).

Next up, Define the target URL and initialize an empty array for the product data:



   const url = "https://scrapeme.live/shop/";
   const product_data = [];

We are setting the target URL and also creating an array to store the scraped product data.

Let's define an asynchronous function named getProducts. This function takes a parameter, pageCount, which indicates the number of pages we intend to scrape data from. Now, within the function, we'll enclose our core logic within a try-catch block where we'll place our actual code logic.



async function getProducts(pageCount) {
  try {
    // Your code logic goes here
  } catch (error) {
    console.error(error);
  }

Writing the logic inside the try catch block will help handle errors.
Note: It is always a good practice to always use console.errorfor errors and console.log for other messages.

Next is to actually write our logic. Inside the try block, paste this:



const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    const products = $("li.product");  // Select all product elements

    products.each(function (index, element) {
      const title = $(element)
        .find(".woocommerce-loop-product__title")
        .text()
        .trim();
      const price = $(element).find(".price").text().trim();

      // Push each product as an object with title and price to the product_data array
      product_data.push({ title, price });
    });

In the code above, an HTTP GET request is made to the specified URL using Axios, fetching HTML content from the web page. Cheerio is then utilized to parse the HTML and extract product information, specifically titles and prices, from elements that match the CSS selector.

The extracted data is organized into objects and stored in an array named product_data. Now, while this code will work if we run it. the only limitation is that it scraps only one page.

Sometimes, we might need to scrape from more than one page to achieve this. To complete the process of pagination, we'll introduce a mechanism that allows us to navigate through the multiple pages where our desired data is spread.

The primary objective is to manage the pagination aspect of web scraping. It checks if there are more pages to scrape (based on a page count limit and the presence of a "Next" page link) and, if so, proceeds to the next page. This ensures that the script systematically scrapes data from each page, collecting all the required information.

Pagination on websites is often achieved by providing a "Next" button on each page, except for the last page. To implement pagination in your own web scraping project, you'll typically need to identify the selector for the "Next" button link. If the selector provides a value, the script extracts the "href" attribute of the link, which contains the URL of the next page.

This URL is used to navigate to the subsequent page and continue the data extraction process. This process continues until the page count limit is reached or there are no more "Next" buttons, at which point the script finalizes the scraping.

However, it's worth noting that handling pagination can be more complex on certain websites. Some websites use numbered pagination with changing URLs, while others use asynchronous calls to an API to get more content. In such cases, the approach to handling pagination may need to be adjusted accordingly.

To initiate pagination on our project, start by creating a selector for the next page link at the top of our code right after our const url.



const baseUrl = "https://scrapeme.live/shop/";

Now, let's add the following code snippet to manage pagination in a web scraping script:



 if (
      pageCount < 5 && // Limit to 5 pages
      $(".next.page-numbers").length > 0
    ) {
      const next_page =
        baseUrl +
        $(".next.page-numbers").attr(
          "href"
        );
      await getProducts(pageCount + 1, next_page);
    } else {
      const parser = new j2csv();
      const csv = parser.parse(product_data);
      fs.writeFileSync("./product.csv", csv);
      console.log(product_data);
    }

In the code above, the script checks if the current page count is below five, and a specific navigation link is present in the HTML, indicating the availability of more pages. If these conditions are met, the script recursively fetches the next page's URL and continues the web scraping process. When the pagination limit is reached, the collected product data is transformed into a CSV format using the j2csv library and saved to a file named product.csv.

Right after everything, let's just add this line of code :



getProducts(0).then(() => console.log("Scraping completed."));

This will initiate the web scraping process with an initial pageCount of 0 using the getProducts function.

Here's the code in full:



const cheerio = require("cheerio");
const axios = require("axios");
const j2csv = require("json2csv").Parser;
const fs = require("fs");

const url = "https://scrapeme.live/shop/";
const baseUrl = "https://scrapeme.live/shop/";
const product_data = [];

async function getProducts(pageCount) {
  try {
    const response = await axios.get(url);
    const $ = cheerio.load(response.data);

    const products = $("#main > ul > li.product"); 

    products.each(function (index, element) {
      const title = $(element)
        .find(".woocommerce-loop-product__title")
        .text()
        .trim();
      const price = $(element).find(".price").text().trim();

      product_data.push({ title, price });
    });

    if (pageCount < 5 && $(".next.page-numbers").length > 0) {
      const next_page = baseUrl + $(".next.page-numbers").attr("href");
      await getProducts(pageCount + 1, next_page);
    } else {
      const parser = new j2csv();
      const csv = parser.parse(product_data);
      fs.writeFileSync("./product.csv", csv);
      console.log(product_data);
    }
  } catch (error) {
    console.error(error);
  }
}
getProducts(0).then(() => console.log("Scraping completed."));

Save the file and run the script using:



node scrape.js

The script will scrape product data from the specified website, handle pagination, and store the results in a CSV file named product.csv, which is the new file created and can viewed using any spreadsheet program, e.g., Microsoft Excel.

Congratulations! You've successfully implemented web scraping using Node.js, Axios,Cheerio and json2csv.

Conclusion

In this tutorial, we explored web scraping using Node.js, Axios,Cheerio and json2csv.

Using Axios, we were able to make HTTP calls to a target website and retrieve HTML content. Cheerio, which served as a server-side jQuery implementation, made it simple to parse and change the HTML structure. This combination was successful at identifying and retrieving specific information, such as product titles and pricing.

The provided code demonstrated how to scrape product data from an e-commerce site, handle pagination, and export the obtained data to a CSV file using the json2csv tool.

Remember to be knowledgeable about legal and ethical aspects, be conscious of the impact on the target website, and maintain compliance with relevant regulations when you embark on web scraping projects. When done correctly, web scraping offers a world of possibilities for data-driven insights and automation.