Introduction to Web Scraping with Nodejs

#javascript #node #programming

What is Web scraping?

Web scraping is the process of extracting content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and with it, data stored in a database.

Note: Not all site are allowed to be scraped, you should make enquiries about a site before scraping.
Before scraping a site make an enquiry if you are allowed to scrap the site, you can check the site privacy policy and terms and condition.

Fetching the webpage

The site we will be scraping is Stackoverflow Jobs it is a section where job vacancies are listed.

Getting started

Step 1: Setting up the working directory

Now that we have Node.js and npm installed, we can start with the project. Open up your preferred terminal and run these commands:

if you don't have Nodejs and npm installed you can check there official docs on how to do that NodeJs Docs.

Create a directory
Move into the directory

mkdir web-scraper 
cd web-scraper

Now we have a directory for our web-scraper, but we need a package.json, this tells npm information about our project. To do this, (in the same terminal window) we need to run this:

npm init

This command will tell npm to initialize a pre-made package.json in our project directory. Just hit enter at all of the prompts, we can worry about those later.

Step 2: Install necessary packages

For this project, we will only need two(2) npm package axios and cheerio. An npm package is essentially a piece of code (“package”) in the npm registry that we can download with a simple command, npm install.

npm install axios
npm install cheerio

Step 3: Write some code!

const axios = require("axios");
const cheerio = require("cheerio");

const url = "https://stackoverflow.com/jobs";

(async () => {
    try {
        const res = await axios.get(url);
        const html = res.data;

        //loading response data into a Cheerio instance
        const $ = cheerio.load(html);

        const siteName = $(".-logo").text();

        // This would return the site Name
        console.log(siteName);

    } catch (error) {
        console.log(error);
    }
})();

Essentially, what this above code does is:

To include the modules used in the project with the require function, which is built-in within Node.js.
To make a GET HTTP request to the target web page with Axios..

Notice that when a request is sent to the web page, it returns a response. This Axios response object is made up of various components, including data that refers to the payload returned from the server.

So, when a GET request is made, we output the data from the response, which is in HTML format.

We loaded the response data into a Cheerio instance. This way, we can create a Cheerio object to help us in parsing through the HTML from the target web page and finding the DOM elements for the data we want—just like when using jQuery.

To uphold the infamous jQuery convention, we’ll name the Cheerio object, $.

We used the Cheerio’s selectors syntax to search the elements containing the data we want which is the site name:

Now, run the app.js file with this command:

node app.js

You should see something like this:

static@Abdulfatais-MacBook web-scraper $ node app.js

Stack Overflow

Now let's proceed with writing script to get job vacancies.

The Below. code looks for a parent class for every job listing and loop through it and then get it properties eg: title, link and date.
You can still select more like the location and amount just target the element name.

After that, it stores the values in an object then console log the data.

const axios = require("axios");
const cheerio = require("cheerio");

const url = "https://stackoverflow.com/jobs";

(async () => {
    try {
        const res = await axios.get(url);
        const html = res.data;

        //loading response data into a Cheerio instance
        const $ = cheerio.load(html);

        $('.fl1').each((i, el) => {
            const title = $(el).find('.fs-body3').text().replace(/s\s+/g, '');
            const link = $(el).find('.s-link').attr('href');
            const date = $(el).find('.fc-orange-400').text();
            const data = {
                title,
                link: `https://stackoverflow.com/${link}`,
                date
            }

            console.log(data);
        });

    } catch (error) {
        console.log(error);
    }
})();

If everything goes well you should get this response on your console.

static@Abdulfatais-MacBook web-scraper $ node app.js

{
  title: '\nFull-Stack Software Engineer            ',
  link: 'https://stackoverflow.com//jobs/471179/full-stack-software-engineer-unhedged',
  date: '5d ago'
}
{
  title: '\nSoftware Engineering            ',
  link: 'https://stackoverflow.com//jobs/473617/software-engineering-jpmorgan-chase-bank-na',
  date: '5h ago'
}
{
  title: '\nSenior Software Engineer (Backend) (m/w/d)            ',
  link: 'https://stackoverflow.com//jobs/471126/senior-software-engineer-backend-m-w-d-gp-9000-gmbh',
  date: '7d ago'
}
{
  title: '\nSenior Backend Engineer Who LoveTypescript            ',
  link: 'https://stackoverflow.com//jobs/470542/senior-backend-engineer-who-loves-typescript-well-health-inc',
  date: '6d ago'
}
{
  title: '\nJava Developer - Software Engineering            ',
  link: 'https://stackoverflow.com//jobs/473621/java-developer-software-engineering-jpmorgan-chase-bank-na',
  date: '5h ago'
}
{
  title: '\nSenior Software Engineer            ',
  link: 'https://stackoverflow.com//jobs/473494/senior-software-engineer-nori',
  date: '7h ago'
}

Hopefully, this article was able to take you through the steps of scraping your first website.

In my other articles to come, if I have the opportunity, I would write about topics on Node.js. Kindly drop your requests in the comment section as well as like.

You can also check out my previous article on Creating a Telegram Bot with Nodejs.

Conclusion

We saw the possibility of web scraping with Nodejs and learned how to scrap a site with nodejs. If you have any questions, don't hesitate to contact me on Twitter: @iamnotstatic

Top comments (8)

Adam Nathaniel Davis • Dec 22 '20

I don't normally point out things as pedantic as spelling errors, but since it's in your title, and it's used repeatedly throughout the article...

It's scraping. Not scrapping. "Scrapping" is a colloquial term for discarding something. Like, "We're going to be scrapping that old mainframe system in favor of a cloud-based application."