DEV Community

Cover image for Web Scraping with Node and Puppeteer
Harsh Singh
Harsh Singh

Posted on • Updated on

Web Scraping with Node and Puppeteer

In this post, we'll be making our first small little web scraping app.

Before we get started, let's just talk a little bit about web scraping and what it is. The most simlified definition for web scraping is "extracting data from websites", which is somewhat implied by the name. It has always been very much of a grey area. Going into a legal discussion is beyond the scope of this article, though I will recommend this blog post going into deeper detail about that.

So, to introduce today's project, we'll be building a simple GitHub follower counter, to count how many followers a user has on GitHub through the terminal.

Initialising

First, let's make a directory for this repository.

mkdir github-follower-counter

cd github-follower-counter
Enter fullscreen mode Exit fullscreen mode

Open it in your code editor. If you're using Visual Studio Code you can simply do code .

Initialise yarn(or npm)

yarn init -y

# For NPM
# npm init -y
Enter fullscreen mode Exit fullscreen mode

Install puppeteer

yarn add puppeteer 

# For NPM
# npm i puppeteer
Enter fullscreen mode Exit fullscreen mode

Getting started with the code

First off, let's import puppeteer into our project.

const puppeteer = require('puppeteer')
Enter fullscreen mode Exit fullscreen mode

Now, let's get the terminal arguments from the user. To do this, we can use process.argv

let username = process.argv[2]

if (username == null) return console.log('Error! Please specify a user!')
Enter fullscreen mode Exit fullscreen mode

Next, let's create our getFollowers function.

const getFollowers = async(user=`https://github.com/${username}`) => {

}
Enter fullscreen mode Exit fullscreen mode

Inside it, let's launch the browser, open a new tab, and navigate to the URL.

   let browser = await puppeteer.launch()
   let page = await browser.newPage()
   await page.goto(user)
Enter fullscreen mode Exit fullscreen mode

Inside it, let's evaluate the page.

   let githubFollowers = await page.evaluate(() => {

   })
Enter fullscreen mode Exit fullscreen mode

Now, let's get the follower count. If we navigate over to GitHub, and right click < view page source (or ctrl+u). We can see the code of the website.

Inside of here, we can see that the span element, with the class of text-bold text-gray-dark has the current follower count.

image

Back to our code, let's do

      const followerCount = document.querySelector('span.text-bold').innerHTML
Enter fullscreen mode Exit fullscreen mode

Now, let's output the results. There is an error however. If a user does not exist, then it will show us as "optional" on the follower count. To prevent this, we can do...

      if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
      else return(`That user has a total of ${followerCount} followers!`)
Enter fullscreen mode Exit fullscreen mode

Next, back to our function, let's output this.

   let githubFollowers = await page.evaluate(() => {
      const followerCount = document.querySelector('span.text-bold').innerHTML

      if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
      else return(`That user has a total of ${followerCount} followers!`)
   })

   console.log(githubFollowers)
   })
Enter fullscreen mode Exit fullscreen mode

Make sure to close the browser window as well.

await browser.close()
Enter fullscreen mode Exit fullscreen mode

At the bottom, don't forget to call this function.

getFollowers()
Enter fullscreen mode Exit fullscreen mode

And you should be good to go! Make sure to run node index.js followed by a username to test it out!

_Note: a far better way to do this is to use the GitHub api. This was primarily a way on how to select and get certain elements, if you're looking to make an actual project with this, then the GitHub API is the way to go!

Thanks for reading, Happy Thanksgiving.

Discussion (0)