Oyedele Temitope

Posted on Mar 24

How to Scrape Products from a Page with Infinite Scroll via 'Load More' Button

#node #webdev #webscraping #backenddevelopment

Have you ever wanted to scrape data but didn’t know where to start? Or have you struggled with websites that hide extra data behind interactions like a "Load More" button? Don’t worry; web scraping doesn’t have to be complicated, and you don’t need to be a pro to get started.

In this tutorial, you’ll learn how to scrape data in Node.js. We’ll use tools like Puppeteer and Cheerio to extract data from the website.

Prerequisites

To follow along with this tutorial, you’ll need to meet the following prerequisites:

Basic knowledge of JavaScript and Node.js
Node.js installed
A code editor
Basic knowledge of web scraping

Project Scope

For this project, we’ll scrape products along with the necessary information about the products from this demo website. You can check it out to get familiar with

Below are some of the libraries needed for scraping the data in this project:

Puppeteer: For the project, it will be used to scrape data and simulate browser interactions, such as the “Load More” button.
Cheerio: For this project, it will be used to parse and extract data.
Csv-writer: exporting data into a structured CSV file
Set Up the Project

Open your terminal and navigate to the folder where you want this project set up. Initialize a new Node.js project by running this command:

npm init -y

Next, install the required libraries:

npm install puppeteer cheerio csv-writer

Access the Content

First, check out the site from which you want to extract the data for this project. Below is an image of what the webpage you’ll extract data from looks like:

You’ll notice that 12 products are displayed on the webpage. We're focusing on extracting these 12 products. This will be achieved using CSS selectors.

Go back to the demo webpage. Right-click on an element and select Inspect to open the Developer Tools. Alternatively, you can open the Developer Tools by pressing Ctrl+Shift+I (or Cmd+Option+I on Mac).

Locate the container for all products. In this case, it's a div with the ID #product-grid. Copy the selector.

The selector you just copied will be used to access the content.

In order to extract the product grid's HTML structure, Puppeteer will be used to navigate to the demo webpage and extract the product grid’s HTML using the #product-grid selector.

In your project folder, create a file called scrape.js. Inside it, add the following code:

const puppeteer = require("puppeteer");

(async () => {
  const url = "https://www.scrapingcourse.com/button-click"; 
  const browser = await puppeteer.launch({ headless: true }); 
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: "domcontentloaded" });

  await page.waitForSelector("#product-grid"); 
  const productGridHTML = await page.evaluate(() => {
    return document.querySelector("#product-grid").outerHTML;
  });

  console.log("Initial product grid HTML:", productGridHTML); 

  await browser.close();
})();

The await page.waitForSelector ensures the #product-grid element is loaded. This guarantees the content is available before scraping. It then uses page.evaluate to extract the HTML structure of the #product-grid and logs the result to the console.

Run your code using this command:

node scrape.js

You’ll see an output in your terminal printing the HTML structure of the #product-grid, which contains the first 12 products:

Load More Products

Each time you click the “Load More” button, it displays more products. This means there are still other products that have yet to be scraped. To achieve this, you need to simulate clicking the button multiple times to retrieve all the available products.

You first need to identify its selector, so go back to the demo webpage. Right-click on the "Load More" button and select Inspect. The HTML code for the button will then be highlighted:

In the Developer Tools, you’ll see the button's structure:

<button id="load-more-btn" data-offset="10">Load More</button>

This tells you two important things: First, you can uniquely identify the button using the id attribute, which is #load-more-btn.

Secondly, the data-offset attribute keeps track of the number of products loaded. After each click, the data-offset value changes to reflect the updated count.

Select Copy > Copy selector to use the selector in your code.

To ensure new products are fully loaded before scraping, we’ll use await page.waitForFunction. This waits for the data-offset value of the "Load More" button to update.

Update the logic in the async function as follows:

(async () => {
  const url = "https://www.scrapingcourse.com/button-click";
  const browser = await puppeteer.launch({ headless: true }); 

  const page = await browser.newPage();
  await page.goto(url, { waitUntil: "domcontentloaded" });
  await page.waitForSelector("#load-more-btn");
  for (let i = 0; i < 5; i++) {
    const currentOffset = await page.evaluate(() => {
      return document
        .querySelector("#load-more-btn")
        .getAttribute("data-offset");
    });
    await page.click("#load-more-btn"); 
    await page.waitForFunction(
      (prevOffset) => {
        const button = document.querySelector("#load-more-btn");
        return button && button.getAttribute("data-offset") !== prevOffset;
      },
      { timeout: 5000 },
      currentOffset,
    );
  }
  const productGridHTML = await page.evaluate(() => {
    return document.querySelector("#product-grid").outerHTML;
  });
  console.log("Updated product grid HTML:", productGridHTML); 
  await browser.close(); 
})();

The main logic introduced here is the ability to handle product dynamic loading. Each time the button is clicked, the data-offset attribute of the button is tracked to verify that new content has been loaded.

You'll also notice a for loop in the logic. it used to click the button multiple times. after each click the await page.waitForSelector waits for the data-offset value to update, until it disappears which indicates that there are no more products availble to be displayed.

For additional techniques on handling dynamic content, check out how to do web scraping with Puppeteer and NodeJS.

Your code should now look like this:

const puppeteer = require("puppeteer");

(async () => {
   const url = "https://www.scrapingcourse.com/button-click"; 
   const browser = await puppeteer.launch({ headless: true }); 
    const page = await browser.newPage();
   await page.goto(url, { waitUntil: "domcontentloaded" });
   await page.waitForSelector("#load-more-btn"); 

   for (let i = 0; i < 5; i++) {
     const currentOffset = await page.evaluate(() => {

         .querySelector("#load-more-btn")
         .getAttribute("data-offset");
     });
     await page.click("#load-more-btn"); 
     await page.waitForFunction(
       (prevOffset) => {
         const button = document.querySelector("#load-more-btn");
         return button && button.getAttribute("data-offset") !== prevOffset;
       },
       { timeout: 5000 }, 
       currentOffset,
     );
   }
   const productGridHTML = await page.evaluate(() => {
     return document.querySelector("#product-grid").outerHTML;
   });
   console.log("Updated product grid HTML:", productGridHTML); 
   await browser.close(); 
 })();

Re-run the code, and you’ll see all the products loaded and displayed in your terminal:

Parse Product Information

Now that the full HTML of the products has been captured, the next step is to parse the product details. This is where Cheerio comes in. It will be used to parse the HTML of the product grid and extract specific details for each product, such as name, price, and image URL, into a JSON file. The .product-item selector will be used to help locate individual products.

Import the Cheerio and fs library:

const cheerio = require("cheerio"); // Import Cheerio for parsing HTML
const fs = require("fs"); // Import the file system module for saving data

Update the logic in the async function to this:

(async () => {
  const url = "https://www.scrapingcourse.com/button-click";   const browser = await puppeteer.launch({ headless: true });

  const page = await browser.newPage();

  await page.goto(url, { waitUntil: "domcontentloaded" });
  await page.waitForSelector("#load-more-btn"); // Wait for the "Load More" button to be available

  for (let i = 0; i < 5; i++) {
    const currentOffset = await page.evaluate(() => {

      return document
        .querySelector("#load-more-btn")
        .getAttribute("data-offset");
    });

    await page.click("#load-more-btn");

    await page.waitForFunction(
      (prevOffset) => {
        const button = document.querySelector("#load-more-btn");
        return button && button.getAttribute("data-offset") !== prevOffset;
      },
      { timeout: 5000 }, 
      currentOffset,
    );
  }

  const productGridHTML = await page.evaluate(() => {
    return document.querySelector("#product-grid").outerHTML;
  });

  const $ = cheerio.load(productGridHTML); 
  const products = []; 


  $(".product-item").each((_, element) => {
    const name = $(element).find(".product-name").text().trim();
    const price = $(element).find(".product-price").text().trim();
    const image = $(element).find(".product-image").attr("src");
    const url = $(element).find("a").attr("href");

    const fullImageUrl = image.startsWith("http")
      ? image
      : `https://www.scrapingcourse.com${image}`;
    const fullUrl = url.startsWith("http")
      ? url
      : `https://www.scrapingcourse.com${url}`;

    products.push({
      name,
      price,
      image: fullImageUrl,
      url: fullUrl,
    });
  });

  // Save the product details to a JSON file
  fs.writeFileSync("products.json", JSON.stringify(products, null, 2));
  console.log("Product data saved to products.json");

  await browser.close(); 
})();

Since the full HTML of the product has already been scraped into productGridHTML all that Cheerio needs to do is process the productGridHTML by converting it into a DOM-like structure in memory. This enables queries and manipulations like how you would use CSS selectors in a browser. Cheerio uses .product-item to locate each product container and .product-name, .product-price, .product-image, and <a> elements to retrieve specific details.

Your code should now look like this:

const puppeteer = require("puppeteer");
const cheerio = require("cheerio"); 
const fs = require("fs"); 

(async () => {
   const url = "https://www.scrapingcourse.com/button-click"; 
   const browser = await puppeteer.launch({ headless: true }); 
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: "domcontentloaded" });
   await page.waitForSelector("#load-more-btn"); // Wait for the "Load More" button to be available

   for (let i = 0; i < 5; i++) {
     const currentOffset = await page.evaluate(() => {

       return document
         .querySelector("#load-more-btn")
         .getAttribute("data-offset");
     });
      await page.click("#load-more-btn");
      await page.waitForFunction(
       (prevOffset) => {
         const button = document.querySelector("#load-more-btn");
         return button && button.getAttribute("data-offset") !== prevOffset;
       },
       { timeout: 5000 }, 
       currentOffset,
     );
   }

   const productGridHTML = await page.evaluate(() => {
     return document.querySelector("#product-grid").outerHTML;
   });
    const $ = cheerio.load(productGridHTML); 
   const products = []; 

   $(".product-item").each((_, element) => {
     const name = $(element).find(".product-name").text().trim();
     const price = $(element).find(".product-price").text().trim();
     const image = $(element).find(".product-image").attr("src");
     const url = $(element).find("a").attr("href");

     const fullImageUrl = image.startsWith("http")
       ? image
       : `https://www.scrapingcourse.com${image}`;
     const fullUrl = url.startsWith("http")
       ? url
       : `https://www.scrapingcourse.com${url}`;

     products.push({
       name,
       price,
       image: fullImageUrl,
       url: fullUrl,
     });
   });

   fs.writeFileSync("products.json", JSON.stringify(products, null, 2));
   console.log("Product data saved to products.json");
    await browser.close(); 
 })();

Re-run the code, and you’ll have a product.json displaying the details like this:

You’ve successfully scraped a dynamic website, Node.js, using Puppeteer and Cheerio. Now, you can decide to stop here or go further and convert the JSON file to CSV. If you decide to stop here, no problem, but if you decide to continue, then let’s get on with it!

Export Product Information to CSV

Import the csv-writer library:

const createCsvWriter = require("csv-writer").createObjectCsvWriter;

Import the csv-writer library:

Next, add this code inside the async function right before the await browser.close:

const csvWriter = createCsvWriter({
   path: "products.csv", 
   header: [
     { id: "name", title: "Name" },
     { id: "price", title: "Price" },
     { id: "image", title: "Image URL" },
     { id: "url", title: "Product URL" },
   ],
 });

 const jsonData = JSON.parse(fs.readFileSync("products.json", "utf-8"));
 await csvWriter.writeRecords(jsonData);
 console.log("Product data successfully written to products.csv");

Now your code looks like this:

const puppeteer = require("puppeteer"); 
const cheerio = require("cheerio"); 
const fs = require("fs"); 
const createCsvWriter = require("csv-writer").createObjectCsvWriter; 

(async () => {
 const url = "https://www.scrapingcourse.com/button-click"; 
 const browser = await puppeteer.launch({ headless: true }); 

 const page = await browser.newPage();

 await page.goto(url, { waitUntil: "domcontentloaded" });
 await page.waitForSelector("#load-more-btn"); 

 for (let i = 0; i < 5; i++) {
   const currentOffset = await page.evaluate(() => {
       .querySelector("#load-more-btn")
       .getAttribute("data-offset");
   });

   await page.click("#load-more-btn");

   await page.waitForFunction(
     (prevOffset) => {
       const button = document.querySelector("#load-more-btn");
       return button && button.getAttribute("data-offset") !== prevOffset;
     },
     { timeout: 5000 }, 
     currentOffset
   );
 }

 const productGridHTML = await page.evaluate(() => {
   return document.querySelector("#product-grid").outerHTML;
 });

 const $ = cheerio.load(productGridHTML); // Load the HTML into Cheerio for parsing
 const products = []; 

 $(".product-item").each((_, element) => {
   const name = $(element).find(".product-name").text().trim();
   const price = $(element).find(".product-price").text().trim();
   const image = $(element).find(".product-image").attr("src");
   const url = $(element).find("a").attr("href");

   const fullImageUrl = image.startsWith("http")
     ? image
     : `https://www.scrapingcourse.com${image}`;
   const fullUrl = url.startsWith("http")
     ? url
     : `https://www.scrapingcourse.com${url}`;
   products.push({
     name,
     price,
     image: fullImageUrl,
     url: fullUrl,
   });
 });

 fs.writeFileSync("products.json", JSON.stringify(products, null, 2));
 console.log("Product data saved to products.json");
 const csvWriter = createCsvWriter({
   path: "products.csv", 
   header: [
     { id: "name", title: "Name" },
     { id: "price", title: "Price" },
     { id: "image", title: "Image URL" },
     { id: "url", title: "Product URL" },
   ],
 });

 const jsonData = JSON.parse(fs.readFileSync("products.json", "utf-8"));
 await csvWriter.writeRecords(jsonData);
 console.log("Product data successfully written to products.csv");

 await browser.close();
})();

Re-run your code, and you’ll see that a products.csv has been created. It contains all the data that was scraped.

Now we’re finally done!

That’s How To Scrape Data in Node.js using Puppeteer and Cheerio

In this tutorial, we explored how to scrape data in Node.js using Puppeteer and Cheerio. With the right tools, web scraping can be effective and approachable, even for beginners.

Remember to always respect the ethical boundaries of web scraping and adhere to the terms of service of the websites you work with. To ensure you’re scraping responsibly, check out Web Scraping Best Practices and Tools 2024 for more information.

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

Top comments (0)

5 Playwright CLI Flags That Will Transform Your Testing Workflow

0:56 --last-failed: Zero in on just the tests that failed in your previous run
2:34 --only-changed: Test only the spec files you've modified in git
4:27 --repeat-each: Run tests multiple times to catch flaky behavior before it reaches production
5:15 --forbid-only: Prevent accidental test.only commits from breaking your CI pipeline
5:51 --ui --headed --workers 1: Debug visually with browser windows and sequential test execution

Learn how these powerful command-line options can save you time, strengthen your test suite, and streamline your Playwright testing experience. Click on any timestamp above to jump directly to that section in the tutorial!

Watch Full Video 📹️

DEV Community