DEV Community

Cover image for Web Scraping React Application using Node.js
collegewap
collegewap

Posted on • Originally published at codingdeft.com

Web Scraping React Application using Node.js

You might have searched for web scraping and got solutions that use Cheerio and axios/fetch.

The problem with this approach is we cannot scrape dynamically rendered web pages or client-side rendered web pages using Cheerio.

To scrape such webpages, we need to wait for the page to finish rendering.
In this article, we will see how to wait for a particular section to appear on the page and then access that element.

Initial setup

Consider the page https://cra-crawl.vercel.app.
Here, we have a title and a list of fruits.

inspect element

If you inspect the page, you will see that the heading is inside the h1 tag and the list has a class named 'fruits-list'.
We will be using these 2 elements to access the heading and the list of fruits.

Creating Node project

Create a directory called node-react-scraper and run the command npm init -y. This will initialize an npm project.

Now install the package puppeteer using the following command:

npm i puppeteer
Enter fullscreen mode Exit fullscreen mode

Puppeteer is a headless browser (Browser without UI) to automatically browse a web page.

Create a file called index.js inside the root directory.

Reading the heading

We can use the puppeteer as follows in index.js

const puppeteer = require("puppeteer")

// starting Puppeteer
puppeteer
  .launch()
  .then(async browser => {
    const page = await browser.newPage()
    await page.goto("https://cra-crawl.vercel.app/")
    //Wait for the page to be loaded
    await page.waitForSelector("h1")

    let heading = await page.evaluate(() => {
      const h1 = document.body.querySelector("h1")

      return h1.innerText
    })

    console.log({ heading })

    // closing the browser
    await browser.close()
  })
  .catch(function (err) {
    console.error(err)
  })
Enter fullscreen mode Exit fullscreen mode

In the above code, you can see that we are waiting for the h1 tag to appear on the page and then only accessing it.

You can run the code using the command node index.js.

Accessing the list of fruits

If you want to access the list of fruits, you can do so by using the following code:

const puppeteer = require("puppeteer")

// starting Puppeteer
puppeteer
  .launch()
  .then(async browser => {
    const page = await browser.newPage()
    await page.goto("https://cra-crawl.vercel.app/")
    //Wait for the page to be loaded
    await page.waitForSelector(".fruits-list")

    let heading = await page.evaluate(() => {
      const h1 = document.body.querySelector("h1")

      return h1.innerText
    })

    console.log({ heading })

    let allFruits = await page.evaluate(() => {
      const fruitsList = document.body.querySelectorAll(".fruits-list li")

      let fruits = []

      fruitsList.forEach(value => {
        fruits.push(value.innerText)
      })
      return fruits
    })

    console.log({ allFruits })
    // closing the browser
    await browser.close()
  })
  .catch(function (err) {
    console.error(err)
  })
Enter fullscreen mode Exit fullscreen mode

Here we are using the querySelectorAll API to get the list of nodes containing fruits. Once we get the list, we are looping through the nodes and accessing the text inside it.

Source code

You can view the complete source code here.

Top comments (0)