Timilehin Okunola

Posted on Jul 12, 2024

How To Scrape Web Applications Using Puppeteer

Introduction

Website scraping offers a pool of possibilities for extracting data from websites for various purposes, such as analysis and content monitoring, web archiving and preservation, and research. Web scraping is an automated task, and Puppeteer, a popular Node.js library for headless Chrome/Chromium browser control, is a powerful tool.

Scraping multiple web pages simultaneously might be difficult, so we will also use the Puppeteer-Cluster package.

In this tutorial, we will use the popular scraping package Puppeteer to scrape the website books.toscrape.com, which was built for scraping purposes. We will use the puppeteer-cluster package to scrape the details of the first 100 books on this website.

Prerequisite

To follow along with this tutorial, you need to have the following installed on your PC
Node >= version 16
Npm
A code editor.

You also need to have a basic knowledge of JavaScript.

Set up puppeteer

Install the package Puppeteer by running the command below.

npm install puppeteer

Now, create a file called index.js.
Now paste the code below into the index.js file to set up the Puppeteer and take a screenshot of the website's first page.


const puppeteer = require("puppeteer");

(async () => {
const browser = await puppeteer.launch({protocolTimeout:600000 });

const page = await browser.newPage();

 await page.goto(`https://books.toscrape.com/index.html`, {
        timeout: 60000,
      });

// Set screen size
await page.setViewport({ width: 1080, height: 1024 });
await page.screenshot({ path: "homepage.png" });

await browser.close();
})();

Now run the command below in your terminal to see the result.

node index

When the code executes, you will see that a new image file called homepage.png has been created in the project's root folder. It contains a screenshot of the website's first landing page.

Now, let us scrape the website properly.

How To Grab Selectors From a Website

To scrape the website, you must grab selectors pointing to each element you want to scrape data.

To do this,

Open your browser
Navigate to the webpage from which you want to scrape data; we will visit the Book To Scrape Website for this tutorial.
Right-click on the time you wish to scrape, and click on inspect, as shown below.

This opens the developers' tools to display the web page's HTML source document and highlights the inspected element.
Right-click on the element from which you wish to scrape data in the dev tools. This opens another modal.
Highlight the Copy option, and a submenu pops up beside the initial modal. Select the Copy Selector.

This copies the exact path to the element. However, you can edit the path based on your understanding of the page’s HTML document.

Scrape The First Book On the Page

Grab the selector for the first book div to scrape the first book. Then, grab the element's content using the $eval method. This method expects two properties: the element path selector and a callback function where you define the property you need.

Below is a demo of implementing the $eval method.

const firstBook = await page.$eval(
    "#default > div > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article",
    (e) => e.innerHTML
  );

  console.log(firstBook);

When you add this function to the demo, we wrote earlier before the browser.close function. When you run the scraper in the terminal, you should have the HTML within the article element displayed in the console.

Scrape Multiple Books

Using the $$eval method, it is possible to scrape multiple elements, such as li and ol. The $$eval method expects two properties: the selector of the parent element containing the listed items and a callback function that maps over the array of elements and grabs the specified data from a selected element. This method returns an array of the specified data, which it grabs from each element in the parent element whose selector was specified.

Below is a demo of how to do that with the books on the first page of the Books to Scrape website.

const booksArray = await page.$$eval(
        "#default > div > div > div > div > section > div:nth-child(2) > ol> li",
        (elements) =>
          elements.map((el, i) => {
            const bookTitle = el.querySelector("h3> a").getAttribute("title");
});
);

Scrape Data From The First 100 Books on the website

In this section, we will scrape the first 100 books on this website. This website has 50 pages, and each page contains 20 books. This means we will be scraping through the first 5 pages of the website.

To do this, paste the code below in the scraper's main function.

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  let flattenedArray;
  const bookDataArray = [];
  for (let index = 1; index <= 5; index++) {
    if (index === 1) {
      // Navigate the page to a URL
      await page.goto(`https://books.toscrape.com/index.html`, {
        timeout: 60000,
      });

      //Take screenshot of each page
      await page.screenshot({ path: `images/page-${index}.png` });
    } else {
      // Navigate the page to a URL
      await page.goto(
        `https://books.toscrape.com/catalogue/page-${index}.html`,
        {
          timeout: 60000,
        }
      );

      await page.screenshot({ path: `images/page-${index}.png` });
    }

      const booksArray = await page.$$eval(
        "#default > div > div > div > div > section > div:nth-child(2) > ol> li",
        (elements) =>
          elements.map((el, i) => {
            const bookTitle = el.querySelector("h3> a").getAttribute("title");
            const bookPrice = el.querySelector("p.price_color").innerText;
            const imageLink = el.querySelector("img").getAttribute("src");
            const inStock = el.querySelector("p.availability").innerText;

            const bookDetailsLink = el
              .querySelector("h3> a")
              .getAttribute("href");

            const data = {
              i,
              title: `${bookTitle}`,
              detailsLink: `${bookDetailsLink}`,
              price: `${bookPrice}`,
              image: `https://books.toscrape.com/${imageLink}`,
              availability: `${inStock}`
            };

            return data;
          })
      );

      //Add an index number to each book detail.
      const updatedBookNoInDataArray = booksArray.map((e) => {
        return {
          ...e,
          i: index == 1 ? e.i + 1 : (index - 1) * 20 + e.i + 1,
        };
      });

      bookDataArray.push(updatedBookNoInDataArray);

      //Flatten out the array here
      flattenedArray = [].concat(...bookDataArray);

  }

  await browser.close(); 
})();

In the above code snippet, we first declared a flattenedArray and a bookDataArray to store the array of data we scraped. The bookdataArray will contain an array of arrays, and then we flatten out the result into the flattenedArray variable.

We then loop over the first 5 pages by dynamically changing the page number variable on the URL as we loop through each page. We check if we are on the first page, declare the URL for the first page, and then dynamically increment each number as the loop executes.

Then, on each page, we use the $$eval function to grab the array of books. For each book item, we get the following data: the title, the price, the link to the cover image, the link to the description page, and the availability of the book.

So, each page returns an array of 20 items. This means, at the end of each loop, the booksArray contains 20 items. Then we map over the booksArray to add a sequential index to the items based on the page where they were scraped from.

Then, each booksArray is pushed into the bookDataArray. The booksDataArray contains five arrays, each containing 20 book items. Then, we flatten out this array to give just one array, the flattenedArray.

If you log the flattenedArray to the console and run the script, you should have a single array of 100 items logged to the console. Each item would be an object with the following keys: i, title, detailsLink, price, image, and availability. You would also notice the index of each object starts at 1 and ends at 100.

Scrape The Book Description Data For Each of The 100 Books

In this section, we will scrape the book description data for each of the 100 books using the details link. To do this, we will be using another puppeteer package called puppeteer-cluster.

To get started, install the package by running the command below in your terminal.

npm install puppeteer-cluster

Next, import the package into your index file.

const { Cluster } = require("puppeteer-cluster");

Now at the bottom of the script before the browser.close method, declare a new array that will store the data we will be scraping from each book details page.

//some code

const addedData = [];

Initialize the cluster instance by pasting the code below in the script.

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_PAGE,
    maxConcurrency: 100,
    timeout: 10000000,
  });

The code snippet above shows how we set the concurrency to CONCURRENCY_PAGE. This means that each worker in the cluster will have its own separate Puppeteer instance with a single page. This allows parallel task execution on different web pages.

The maxConcurrency is set to 100. This means that we will have a maximum of 100 workers running simultaneously. We set it to 100 because we intend to work with just 100 different pages.

The timeout option sets the timeout duration for tasks the cluster workers execute. This timeout defines the maximum amount of time a worker has to complete a task before it's considered timed out and potentially restarted. The value is specified in milliseconds (ms). Here, it is set to 10,000,000 ms, which is very high, about 10,000 seconds.

Next, declare a callback function that acts as an event listener. This function handles errors that may occur when the cluster is executing each of our pages and logs the error message to the console.

//Catch any error that occurs when you scrape a particular page and log it to the console.
 cluster.on("taskerror", (err, data) => {
    console.log(`Error Crawling ${data}: ${err.message}`);
  });

Next, write the function you need the scraper to execute on each page by pasting the code below into the main scraper script.

//Describe what you want the scraper to do on each page here
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url, { timeout: 100000 });
    const details = await page.$eval("#content_inner > article > p", (el) => {
      if (el === undefined) {
        return "";
      } else {
        return el.innerText;
      }
    });

    const tax = await page.$eval(
      "#content_inner > article > table > tbody > tr:nth-child(5) > td",
      (el) => {
        if (el === undefined) {
          return "";
        } else {
          return el.innerText;
        }
      }
    );
    const noOfleftInStock = await page.$eval(
      "#content_inner > article > table > tbody > tr:nth-child(6) > td",
      (el) => {
        if (el === undefined) {
          return "";
        } else {
          return el.innerText;
        }
      }
    );
    addedData.push({ details, noOfleftInStock, tax });
  });

We check if the page contains the element we are targeting. If it does not, we return an empty string. We then return an object containing the book's details, the number left in stock, and the tax.

for (const url of flattenedArray) {
    if (url.detailsLink.startsWith("catalogue/")) {
      await cluster.queue(`https://books.toscrape.com/${url.detailsLink}`);
    } else {
      await cluster.queue(
        `https://books.toscrape.com/catalogue/${url.detailsLink}`
      );
    }
  }

Then, we describe a for loop for each of the book items in the flattened array. Then, we check if the details link begins with “catalogue/” and just queue the URL by concatenating it with the root URL. If it does not, we add it to the root URL before concatenating. This is because the catalogue path is required in the URL path to retrieve the book details page.

Next, add the lines below to the code.

await cluster.idle();
await cluster.close();

The idle method instructs the cluster to wait for all currently running tasks within its workers of the cluster instance to finish. This ensures that all scraping activities initiated by the queue method are completed before proceeding.

The close method terminates the cluster entirely. This process involves gracefully shutting down all browser instances associated with the cluster workers and releasing any resources allocated to the cluster.
Then, we add the retrieved data from each page to our flattened array using the code snippet below.

const finalbookDataArray = flattenedArray.map((e, i) => {
    return {
      ...e,
      bookDescription: addedData[i].details,
      tax: addedData[i].tax,
      noOfleftInStock: addedData[i].noOfleftInStock,
    };
  });

Finally, let us write all the scraped data into a json file. We can use the node’s fs package as shown below.

//Import the package at the top of the file
const fs = require("fs");

const bookDataArrayJson = JSON.stringify(finalbookDataArray, null, 2);
  fs.writeFileSync("scraped-data.json", bookDataArrayJson);

The final code should look like this.

const puppeteer = require("puppeteer");
const { Cluster } = require("puppeteer-cluster");
const fs = require("fs");

(async () => {
  const browser = await puppeteer.launch({ protocolTimeout: 600000 });

  const page = await browser.newPage();

  let flattenedArray;
  const bookDataArray = [];
  for (let index = 1; index <= 5; index++) {
    if (index === 1) {
      // Navigate the page to a URL
      await page.goto(`https://books.toscrape.com/index.html`, {
        timeout: 60000,
      });

      await page.screenshot({ path: `images/page-${index}.png` });
    } else {
      // Navigate the page to a URL
      await page.goto(
        `https://books.toscrape.com/catalogue/page-${index}.html`,
        {
          timeout: 60000,
        }
      );

      await page.screenshot({ path: `images/page-${index}.png` });
    }

    const booksArray = await page.$$eval(
      "#default > div > div > div > div > section > div:nth-child(2) > ol> li",
      (elements) =>
        elements.map((el, i) => {
          const bookTitle = el.querySelector("h3> a").getAttribute("title");
          const bookPrice = el.querySelector("p.price_color").innerText;
          const imageLink = el.querySelector("img").getAttribute("src");
          const inStock = el.querySelector("p.availability").innerText;

          const bookDetailsLink = el
            .querySelector("h3> a")
            .getAttribute("href");

          const data = {
            i,
            title: `${bookTitle}`,
            detailsLink: `${bookDetailsLink}`,
            price: `${bookPrice}`,
            image: `https://books.toscrape.com/${imageLink}`,
            availability: `${inStock}`,
          };

          return data;
        })
    );

    //Add an index number to each book detail.
    const updatedBookNoInDataArray = booksArray.map((e) => {
      return {
        ...e,
        i: index == 1 ? e.i + 1 : (index - 1) * 20 + e.i + 1,
      };
    });

    bookDataArray.push(updatedBookNoInDataArray);

    //Flatten out the array here
    flattenedArray = [].concat(...bookDataArray);
  }

  const addedData = [];

  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_PAGE,
    maxConcurrency: 100,
    timeout: 10000000,
  });

  //Catch any error that occurs when you scrape a particular page and log it to the console.
  cluster.on("taskerror", (err, data) => {
    console.log(`Error Crawling ${data}: ${err.message}`);
  });

  //Describe what you want the scraper to do on each page here
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url, { timeout: 100000 });
    const details = await page.$eval("#content_inner > article > p", (el) => {
      if (el === undefined) {
        return "";
      } else {
        return el.innerText;
      }
    });

    const tax = await page.$eval(
      "#content_inner > article > table > tbody > tr:nth-child(5) > td",
      (el) => {
        if (el === undefined) {
          return "";
        } else {
          return el.innerText;
        }
      }
    );
    const noOfleftInStock = await page.$eval(
      "#content_inner > article > table > tbody > tr:nth-child(6) > td",
      (el) => {
        if (el === undefined) {
          return "";
        } else {
          return el.innerText;
        }
      }
    );

    // console.log({details, noOfleftInStock, tax})
    addedData.push({ details, noOfleftInStock, tax });
  });

  for (const url of flattenedArray) {
    if (url.detailsLink.startsWith("catalogue/")) {
      await cluster.queue(`https://books.toscrape.com/${url.detailsLink}`);
    } else {
      await cluster.queue(
        `https://books.toscrape.com/catalogue/${url.detailsLink}`
      );
    }
  }

  await cluster.idle();
  await cluster.close();

  const finalbookDataArray = flattenedArray.map((e, i) => {
    return {
      ...e,
      bookDescription: addedData[i].details,
      tax: addedData[i].tax,
      noOfleftInStock: addedData[i].noOfleftInStock,
    };
  });

  const bookDataArrayJson = JSON.stringify(finalbookDataArray, null, 2);
  fs.writeFileSync("scraped-data.json", bookDataArrayJson);

  await browser.close();
})();

Now create a folder named Images and execute the scraper by running the command below in the terminal

node index

When the scraper finishes executing, you should have 5 images in your images folder and a file named scraped-data.json, which contains the json data of data from the website.

Wrapping Up

So far, in this tutorial, we have learned how to scrape data from a website using Puppeteer and how to scrape multiple pages at once using the Puppeteer-cluster package. You can get the full code on my repo here

You can improve your skills by scraping websites like e-commerce and real estate websites. You can also use the puppeteer cluster to create a scraper comparing data between two or more websites.

To learn more about Puppeteer, you can check out their documentation here. You can also check out the puppeteer-cluster package here.

In my next Puppeteer series, I will discuss how to use Puppeteer for Integration testing in web applications.
Till then, you can connect with me on GitHub | X.