Web scraping is a powerful tool for gathering information from websites. When web scraping in NodeJS, we can choose from an array of web NodeJS has several web scraping libraries. However, we'll use Puppeteer to handle the web scraping for this tutorial.
In this tutorial, we'll walk through a simple, step-by-step guide to scraping data from a website with dynamic content. We'll cover making requests using NodeJS, loading additional content, parsing information, and exporting this information to a CSV document.
Let's get right to it!
Step 1: Prerequisites
In this section, we’ll go through all the steps and installations required for this tutorial. To follow along with the rest of this guide, you will need the following:
- NodeJS: You can download and install a NodeJS version with Long-Term Support (LTS) from the NodeJS download page. This installation adds NodeJS to your machine's directory and allows you to install dependencies with the Node Package Manager (NPM).
- A code editor of choice.
With the environment setup completed, you can start setting up NodeJS for your project. To begin, create a new folder in a directory of choice and create a scrape.js
file.
Next, you must initialize NodeJS in the newly created folder. To initialize NodeJS, run the following command at the root of the folder's directory:
npm init -y
This command creates a project scaffold with the node_modules
folder and package.json
file that contains our project's dependencies. Then, we need to download some NPM packages to scrape the website. These packages are:
- puppeteer: Puppeteer is a powerful NodeJS library that automates tasks in a headless browser. It allows interaction with page elements, making it helpful in scraping dynamic content.
- json2csv: A fast and configurable JSON to CSV converter.
All required packages are available on the NPM package registry. To download them to your project, run the following command in the terminal at the root of your project's directory:
npm install puppeteer json2csv
Step 2: Get Access to the Content
After installing all required dependencies, the next step is to access the HTML content on the page. To do this, you’ll use Puppeteer to send the request to the URL and then log the returned response to the console as shown:
// scrape.js
const puppeteer = require("puppeteer");
const url = "https://www.scrapingcourse.com/button-click";
const scrapeFunction = async () => {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the target URL
await page.goto(url);
// Get the URL’s HTML content
const content = await page.content();
console.log(content);
// Close the browser and all of its pages
await browser.close();
};
scrapeFunction();
You can run the code in the scrape.js
file using NodeJS by running the following command in the terminal:
node scrape.js
The output of the code returns the full HTML content of the website in the terminal window:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Load More Button Challenge to Learn Web Scraping - ScrapingCourse.com</title>
<!-- Bootstrap CSS -->
</head>
<body data-new-gr-c-s-check-loaded="14.1174.0" data-gr-ext-installed="" itemscope itemtype="http://schema.org/WebPage">
<main>
<!-- ... -->
</main>
<!-- Bootstrap and jQuery libraries -->
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.5.2/dist/umd/popper.min.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
</body>
</html>
Step 3: Load More Products
Now that you have access to the URL content, you can interact with the page's content to load more products. For this tutorial, you’ll simulate an interaction with the "Load More" button at the bottom of the page.
To do this, you will use Puppeteer's page.click()
and page.waitForSelector()
methods. The page.click()
method accepts the selector of a DOM element, scrolls it into view if required, and then simulates a mouse click at the center of the element. The page.waitForSelector()
method, on the other hand, pauses the script until a node that matches a specific selector appears on the page. This method is helpful when waiting for a dynamic element on a page.
You must have retrieved the selector of the "Load More" button——button#load-more-btn
—on the target page before using the page.click()
method. Also, for this tutorial, we'll be using a for
loop to repeat the button click event till we have a minimum of 48 product cards on the target page.
// scrape.js
// Click the "Load More" button five times
for (let i = 0; i < 5; i++) {
await page.click("button#load-more-btn");
}
// wait for the 48th product card
await page.waitForSelector(".product-grid .product-item:nth-child(48)");
The page.waitForSelector()
method accepts the selector of the 48th product item in the product-grid
element class. This ensures that the rest of the program runs only after the 48th item is rendered.
At this point, your scrape.js
file should look like this:
// scrape.js
const puppeteer = require("puppeteer");
const url = "https://www.scrapingcourse.com/button-click";
const scrapeFunction = async () => {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the target URL
await page.goto(url);
// Click the "Load More" button a fixed number of times
for (let i = 0; i < 5; i++) {
await page.click("button#load-more-btn");
}
// wait for the 48th product card
await page.waitForSelector(".product-grid .product-item:nth-child(48)");
// Get the URL's HTML content
const content = await page.content();
console.log(content);
// Close the browser and all of its pages
await browser.close();
};
scrapeFunction();
Again, you can run the NodeJS script in your terminal using the earlier command. The output of the code returns the full-page HTML content of the website in the terminal window:
<!DOCTYPE html><html lang="en"><head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Load More Button Challenge to Learn Web Scraping - ScrapingCourse.com</title>
<!-- Bootstrap CSS -->
<main>
<!-- ... -->
</main>
<!-- Bootstrap and jQuery libraries -->
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@popperjs/core@2.5.2/dist/umd/popper.min.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js"></script>
</body></html>
Step 4: Parse Product Information
You can parse the product information since you now have the raw HTML data from the target page. Doing this converts the raw HTML data into a more readable format.
To begin, you’ll use Puppeteer's page.evaluate()
method. This method allows you to run JavaScript functions within the context of the target page and returns a result. Then, create an empty array to collect the extracted data and retrieve the selector of the container element as shown:
// extract product information and return an array of products
const products = await page.evaluate(() => {
const productList = [];
productElements = document.querySelectorAll(".product-grid .product-item");
// parsing function goes here
return productList;
});
Next, loop through each product container returned to scrape through the data using each element’s unique selector. Then, push the data into the array created:
// scrape.js
// extract product information and return an array of products
const products = await page.evaluate(() => {
const productList = [];
productElements = document.querySelectorAll(".product-grid .product-item");
// loop through each product to extract the data
productElements.forEach((product) => {
const name = product.querySelector("div span.product-name").textContent;
const imageLink = product
.querySelector("img.product-image")
.getAttribute("src");
const price = product.querySelector("div span.product-price").textContent;
const url = product.querySelector("a").getAttribute("href");
// push the extracted data to the array created
productList.push({ name, imageLink, price, url });
});
return productList;
});
Finally, use NodeJS' file system module—fs
—to create a new products.json
file and write the array content to this file. You’ll also use JavaScript's JSON.stringify()
method to convert the array data to JSON for readability.
// scrape.js
const fs = require("fs");
// create a new JSON file and parse all the data to the file
fs.writeFileSync("products.json", JSON.stringify(products, null, 2));
console.log("Data saved to products.json");
At this point, your scrape.js
file should look like this:
// scrape.js
const puppeteer = require("puppeteer");
const fs = require("fs");
const url = "https://www.scrapingcourse.com/button-click";
const scrapeFunction = async () => {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the target URL
await page.goto(url);
// Click the "Load More" button a fixed number of times
for (let i = 0; i < 5; i++) {
// Click the "Load More" button
await page.click("button#load-more-btn");
}
// wait for the 48th product card
await page.waitForSelector(".product-grid .product-item:nth-child(48)");
// extract product information and return an array of products
const products = await page.evaluate(() => {
const productList = [];
productElements = document.querySelectorAll(".product-grid .product-item");
// loop through each product to extract the data
productElements.forEach((product) => {
const name = product.querySelector("div span.product-name").textContent;
const imageLink = product
.querySelector("img.product-image")
.getAttribute("src");
const price = product.querySelector("div span.product-price").textContent;
const url = product.querySelector("a").getAttribute("href");
// push the extracted data to the array created
productList.push({ name, imageLink, price, url });
});
return productList;
});
// create a new JSON file and parse all the data to the file
fs.writeFileSync("products.json", JSON.stringify(products, null, 2));
console.log("Data saved to products.json");
// Close the browser and all of its pages
await browser.close();
};
scrapeFunction();
The output of the code returns a structured JSON in the products.json
file with the product data:
[
{
"name": "Chaz Kangeroo Hoodie",
"imageLink": "https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg",
"price": "$52",
"url": "https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie"
},
{
"name": "Teton Pullover Hoodie",
"imageLink": "https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg",
"price": "$70",
"url": "https://scrapingcourse.com/ecommerce/product/teton-pullover-hoodie"
},
{
"name": "Bruno Compete Hoodie",
"imageLink": "https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh03-black_main.jpg",
"price": "$63",
"url": "https://scrapingcourse.com/ecommerce/product/bruno-compete-hoodie"
},
{
"name": "Ajax Full-Zip Sweatshirt",
"imageLink": "https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh12-green_main.jpg",
"price": "$69",
"url": "https://scrapingcourse.com/ecommerce/product/ajax-full-zip-sweatshirt"
},
...
{
"name": "Mars HeatTech™ Pullover",
"imageLink": "https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mj10-red_main.jpg",
"price": "$66",
"url": "https://scrapingcourse.com/ecommerce/product/mars-heattech™-pullover"
}
]
Step 5: Export Product Information to CSV
Alternatively, you can also export the product data into a CSV file. This will be an easy step since you already learned how to parse the information earlier.
To do this, you will use the json2csv
package installed at the beginning of the tutorial.
// scrape.js
const json2csv = require("json2csv").Parser;
const fs = require("fs").Parser;
// . . .
// initialize the package
const parser = new json2csv();
// create a new CSV file and parse the data to the file
const productsCSV = parser.parse(products);
fs.writeFileSync("products.csv", productsCSV);
console.log("Data saved to products.csv");
At this point, your scrape.js
file should look like this:
const puppeteer = require("puppeteer");
const fs = require("fs");
const json2csv = require("json2csv").Parser;
const url = "https://www.scrapingcourse.com/button-click";
const scrapeFunction = async () => {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the target URL
await page.goto(url);
// Click the "Load More" button a fixed number of times
for (let i = 0; i < 5; i++) {
// Click the "Load More" button
await page.click("button#load-more-btn");
}
// wait for the 48th product card
await page.waitForSelector(".product-grid .product-item:nth-child(48)");
// extract product information and return an array of products
const products = await page.evaluate(() => {
const productList = [];
productElements = document.querySelectorAll(".product-grid .product-item");
// loop through each product to extract the data
productElements.forEach((product) => {
const name = product.querySelector("div span.product-name").textContent;
const imageLink = product
.querySelector("img.product-image")
.getAttribute("src");
const price = product.querySelector("div span.product-price").textContent;
const url = product.querySelector("a").getAttribute("href");
// push the extracted data to the array created
productList.push({ name, imageLink, price, url });
});
return productList;
});
// initialize the package
const parser = new json2csv();
// create a new CSV file and parse the data to the file
const productsCSV = parser.parse(products);
fs.writeFileSync("products.csv", productsCSV);
console.log("Data saved to products.csv");
// Close the browser and all of its pages
await browser.close();
};
scrapeFunction();
Again, you can run the NodeJS script in your terminal using the earlier command. The output of the code returns a structured CSV in the products.csv
file with the product data:
"name","imageLink","price","url"
"Chaz Kangeroo Hoodie","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg","$52","https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie"
"Teton Pullover Hoodie","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg","$70","https://scrapingcourse.com/ecommerce/product/teton-pullover-hoodie"
"Bruno Compete Hoodie","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh03-black_main.jpg","$63","https://scrapingcourse.com/ecommerce/product/bruno-compete-hoodie"
"Frankie Sweatshirt","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh04-green_main.jpg","$60","https://scrapingcourse.com/ecommerce/product/frankie--sweatshirt"
"Hollister Backyard Sweatshirt","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh05-white_main.jpg","$52","https://scrapingcourse.com/ecommerce/product/hollister-backyard-sweatshirt"
"Stark Fundamental Hoodie","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh06-blue_main.jpg","$42","https://scrapingcourse.com/ecommerce/product/stark-fundamental-hoodie"
"Hero Hoodie","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh07-gray_main.jpg","$54","https://scrapingcourse.com/ecommerce/product/hero-hoodie"
"Oslo Trek Hoodie","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh08-brown_main.jpg","$42","https://scrapingcourse.com/ecommerce/product/oslo-trek-hoodie"
"Kenobi Trail Jacket","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mj04-black_main.jpg","$47","https://scrapingcourse.com/ecommerce/product/kenobi-trail-jacket"
"Jupiter All-Weather Trainer","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mj06-blue_main.jpg","$56.99","https://scrapingcourse.com/ecommerce/product/jupiter-all-weather-trainer"
"Orion Two-Tone Fitted Jacket","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mj07-red_main.jpg","$72","https://scrapingcourse.com/ecommerce/product/orion-two-tone-fitted-jacket"
"Lando Gym Jacket","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mj08-gray_main.jpg","$99","https://scrapingcourse.com/ecommerce/product/lando-gym-jacket"
"Taurus Elements Shell","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mj09-yellow_main.jpg","$65","https://scrapingcourse.com/ecommerce/product/taurus-elements-shell"
"Mars HeatTech™ Pullover","https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mj10-red_main.jpg","$66","https://scrapingcourse.com/ecommerce/product/mars-heattech™-pullover"
Conclusion
In this tutorial, we’ve covered the basics of web scraping:
- Sending requests to access content on a webpage.
- Loading additional content in a dynamic webpage.
- Parsing information retrieved from the webpage.
- Exporting information to JSON and CSV files.
Whether you’re using open-source libraries or a web scraping service, these steps will help you get started with scraping dynamic content efficiently. Happy scraping!
Top comments (0)