loading...
Cover image for Guide to web scraping with Node

Guide to web scraping with Node

harshhhdev profile image Harsh Singh Updated on ・3 min read

In this post, we'll be making our first small little web scraping app.

Before we get started, let's just talk a little bit about web scraping and what it is. The most simlified definition for web scraping is "extracting data from websites", which is somewhat implied by the name. It has always been very much of a grey area. Going into a legal discussion is beyond the scope, though I will reccomend this blog post going into deeper detail about that.

So, to introduce todays project, we'll be building a simple GitHub follower counter, to count how many followers a user has on GitHub through the terminal.

Initialising

First, let's make a directory for this repository.

$ mkdir github-follower-count

Let's navigate to it

$ cd github-follower-count

Open it in your code editor. If you're using Visual Studio Code you can simply do code .

Initialise npm

npm init -y

Create the starting file.

touch index.js

Install puppeteer.

npm i --save puppeteer

You can learn more about puppeteer here

Getting started with the code

First things, let's require puppeteer.

const puppeteer = require('puppeteer')
Enter fullscreen mode Exit fullscreen mode

Let's now setup a way for the terminal to take arguments, to have it output the followers for any user.

let username = process.argv[2]

if (username == null) return console.log('Error! Please specify a user!')
Enter fullscreen mode Exit fullscreen mode

Next, let's create our function.

async function getFollowers(user=`https://github.com/${username}`) {

}
Enter fullscreen mode Exit fullscreen mode

Inside it, let's launch the browser, open a new tab, and navigate to the URL.

   let browser = await puppeteer.launch()
   let page = await browser.newPage()
   await page.goto(user)
Enter fullscreen mode Exit fullscreen mode

Inside it, let's evaluate the page.

   let githubFollowers = await page.evaluate(() => {

   })
Enter fullscreen mode Exit fullscreen mode

Inside, let's get the follower count. If we navigate over to GitHub, and right click < view page source (or ctrl+u). We can see the code of the website.

Inside of here, we can see that the span element, with the class of text-bold text-gray-dark has the current follower count.

image

Back to our code, let's do

      var followerCount = document.querySelector('span.text-bold').innerHTML
Enter fullscreen mode Exit fullscreen mode

Now, let's output the results. There is an error however. If a user does not exist, then it will show us as "optional" on the follower count. To prevent this, we can do...

      if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
      else return(`That user has a total of ${followerCount} followers!`)
Enter fullscreen mode Exit fullscreen mode

Next, back to our function, let's output this.

   let githubFollowers = await page.evaluate(() => {
      var followerCount = document.querySelector('span.text-bold').innerHTML

      if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
      else return(`That user has a total of ${followerCount} followers!`)
   })

   console.log(githubFollowers)
   })
Enter fullscreen mode Exit fullscreen mode

Make sure to close the browser window aswell.

await browser.close()
Enter fullscreen mode Exit fullscreen mode

At the bottom, don't forget to call this function.

getFollowers()
Enter fullscreen mode Exit fullscreen mode

And you should be good to go! Make sure to run node index.js followed by a username to test it out!

You can find the GitHub repository for this project right here.

Note: a far better way to do this is to use the GitHub api. This was primarily a way on how to select and get certain elements, if you're looking to make an actual project with this, then the GitHub Api is the way to go!

Thanks for reading, Happy Thanksgiving.

Discussion

pic
Editor guide