Alex Anie

Posted on Nov 4, 2023 • Originally published at ocxigin.hashnode.dev

Scraping with puppeteer and export in JSON format

#sraping #puppeteer #node #javascript

App development is no walk in the park, figuring out the right themes to use, frameworks or libraries when building an app can be a tedious process, also sourcing for the right APIs to build on can be quite challenging especially when the one you need is not available.

Well in this case scraping the data can be all you need to provide the right data for your app.

In this blog post, we are going to be learning how to use Puppeteer to scrape data from the web and then export it in JSON format.

This will enable you to know how to scrape data with Puppeteer and implement it on any project of your choice.

Prerequires

A basic knowledge of Node.js and JavaScript is required to be able to follow along.

Setup

To get started, create a folder and name it as puppeteer, then open it up as a workspace on VS Code as indicated below 👇

Click on Ctrl + ` backtick to open the terminal on VS Code. 👇

You should already have node.js installed on your system, but if you are unsure if you have it installed. run the command below.

node -v

This will return the current version you have installed on your system

👉 // v21.1.0

Am currently running on version v21.1.0 but depending on when you are reading this or what version you have installed, it will be different from mine.

Next step, generate a package.json file. Type the command below and hit enter.

npm init -y

This will create and set up a package.json file where all dependencies and devDependencies will be saved.

Now that we have that figured out, let's install Puppeteer.

npm i puppeteer

The above command will install Puppeteer on your system if your default package manager is npm but if you are making use of yarn or pnpm you can try the following command below.

# using yarn
yarn add puppeteer

# using pnpm
pnpm i puppeteer*

Directly on your terminal, the installed node_modules should indicate as follows. 👇

By default, Puppeteer will install node_modules and package-lock.json and update the package.json file with the latest version of Puppeteer. As at the time of writing, I have 21.5.0 installed, yours should be similar or higher. Use npm i puppeteer@12.5.0 to install the exact version used in this tutorial.

Entry Point Setup

Create index.js file. This will serve as an entry point for our application.

const puppeteer = require('puppeteer');

click to open the index.js file, type the code above. The syntax above is commonJS, this is what we’ll be using in this tutorial.


(async () => {

 👉 // your code should come write here

})();

The code above is immediately-invoked function expression (IIFE), this allows the code specified inside the function to be executed immediately and it is associated with async/await syntax for handling asynchronous operations.

Now that we have set the function boilerplate, let's create a headless browser instance with Puppeteer.

(async () => {

 const browser = await puppeteer.launch({ headless: "new" });

 const page = await browser.newPage();

 await browser.close();

})();

Here, we are creating an instance of a browser and puppeteer is set to launch. This opens up the browser.

The key and value { headless: "new" } object syntax indicates that the browser should operate in a headless mode, that is to run in the background without displaying its UI.

Next the browser.newPage() opens up a new tag and assigns it to a page variable.

Then the await browser.close() method closes the browser when the function stops running.

Scraping web page data

To begin scraping data, you have to decide which platform data you want to scrap. While it is easy to scrap data, some platforms prevent access to scraping of it data and you will have to use a third-party tool like ZenRows or Oxylabs just to mention a few.

In this example, I will be making use of Techmeme.com for scraping.

(async () => {
  const browser = await puppeteer.launch({ headless: "new" });

  const page = await browser.newPage();

 👉 await page.goto("https://www.techmeme.com/");

  await browser.close();
}
  )();

Here, the goto() method is used to specify the web URL we want to scrap.

Once the URL has been specified. visit and inspect the website you want to scrap to have an overview of the HTML structure. In this case, we will visit techmeme.com and inspect the page to know which element to grab and return.

Right-click on the page, this will open a dialog box, then click inspect. for short hit F12 key. See the example below 👇

Your DevTools should open up as indicated below.

From the above screenshot, we have identified the tag we want to scrap which is the div with the class of clus. The parent of clue has a class of topcol1. So we will be grabbing this with Puppeteer.

Try the code sample below. 👇

const browser = await puppeteer.launch({ headless: "new" });

  const page = await browser.newPage();
  await page.goto("https://www.techmeme.com/");

 👉 const techNewsApis = await page.evaluate(() =>
    Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({
     // Some code example here.
    }))
  );

Here, the Array.from() method is used to grab all the elements with a class of clus that is a direct parent of topcol1, this will create a shallow-copied Array instance from the returned elements.

Also, notice we are using the evaluate() method, which takes a call-back function. It evaluates the response of a Promise and returns its values when the Promise is resolved. In this case, the resolved value is returned to the techNewsApis variable.

We have seen the basic example of how to grab elements, but this is not all, what we have are elements that Puppeteer does not know what to do with it.

What we want to do is return the headline on the webpage and the link associated with it. We also want to implement a unique id to the data.

From the above screenshot, I inspect the first <a> tag and return the name value as an id.

Similarly, I inspect the <a> tag with a class of ourh which are children of the clue tag and then return both the headline and the link associated with it.

See the code below

const techNewsApis = await page.evaluate(() =>
    Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({

     👉   id: e.querySelector(".clus > a").name,
     👉   title: e.querySelector("a.ourh").innerText,
     👉   url: e.querySelector("a.ourh").href,

    }))
  );

Here, querySelector() method is used to access the elements and a property and two attribute is attached to return the respective value associated with it.

name the attribute returns the number string value specified on the <a> tag
innerText returns the text content in between the open and close <a> tag
href returns the link specified on the <a> tag

Now that we have successfully scraped the data, let's return it and see what the data looks like.

const techNewsApis = await page.evaluate(() =>
    Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({
      id: e.querySelector(".clus > a").name,
      title: e.querySelector("a.ourh").innerText,
      url: e.querySelector("a.ourh").href,
    }))
  );

 👉 console.log(techNewsApis);

  await browser.close();

Parse the resolved value to the console.log() method.

node index.js

Go to the terminal and run the above command.

The scraped data is now logged to the console as shown above. With the id, title and url specified respectively.

Exporting scraped data in a JSON format

Logging data to the terminal is not a good use case since we can not access or use it directly from the terminal.

What to do next is to save it on the local directory of the project we are working on so that we can make use of it.

    const puppeteer = require('puppeteer');
 👉 const fs = require('fs');

With the fs (file system) module specified, we can write to save the data to something we can use in an application.

const puppeteer = require('puppeteer');
const fs = require('fs');

(async () => {
  const browser = await puppeteer.launch({ headless: "new" });

  const page = await browser.newPage();
  await page.goto("https://www.techmeme.com/");

  const techNewsApis = await page.evaluate(() =>
    Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({
      id: e.querySelector(".clus > a").name,
      title: e.querySelector("a.ourh").innerText,
      url: e.querySelector("a.ourh").href,
    }))
  );

  //Save data to JSON file
👉    fs.writeFile("techNewsApis.json", JSON.stringify(techNewsApis), (error) => {
👉    if (error) throw error;
👉       console.log(`techNewsApis is now saved on your project folder`);
          });

  console.log(techNewsApis);
  await browser.close();
}
  )();

Here, the fs module is implemented as follows;

fs.writeFile("techNewsApis.json") set the name of the file to be saved.
JSON.stringify(techNewsApis) is used to receive and parse the value to a valid JSON file
if (error) throw error throws an error if any occur

Next, go to the terminal and run the command as shown below.

node index.js

This will save the exported data as a valid JSON file for external use.

From the above screenshot, the exported data is saved directly in the project folder.

Conclusion

Scraping can be very useful in getting specific data in situations where the data we need is not readily available or can be found on another website, in this case, we can scrape such data and use it in our project.

However, this comes with some caveats as some web owners do not grant access for the scraping of their website.