Badewa kayode

Posted on Apr 19, 2020 • Edited on May 18, 2020

Build a CLI to crawl a web page with web-crawljs

#node #beginners #webcrawler #tutorial

Moving Articles from medium to dev.to

In this tutorial, we are going to create a web crawler that scraps information from Wikipedia pages. This web crawler would run
from a command-line interface (e.g. terminal, command prompt).

The code for this article is on github.

An example of the command that would crawl the page would look like

$ node crawl.js -d 3 -x wiki

The command will get a config file named wiki and saves the crawled data to a MongoDB collection called wiki.

Web Crawling

Web crawlers are programs written to get information from a web page.

“A Web crawler, sometimes called a spider, is an Internet bot that systematically 
browses the World Wide Web, typically for the purpose of Web indexing”
 — Wikipedia

What we will be needing

For this project, we will be needing commander, web-crawljs, and mongoose.

Commander

Commander is an npm module that makes working with the command line interface easier. It makes it easy to handle command-line arguments. Check out it’s documentation.

web-crawljs

web-crawljs is an npm module that crawls web pages and extracts information from the page. It makes crawling web pages with Nodejs easy.

The only thing that web-crawljs needs, is a configuration object for it to start crawling.

why web-crawljs

One of the reasons I chose web-crawljs is because of how easy it is to crawl web pages with it. It is also a light weight web crawler, that is, it uses far less CPU and RAM compared to using a headless browser (e.g Phanthomjs).

Due to the less CPU and RAM usage, it can not render a SPA (single page application) pages. And also because I built It :).

All that is required to run it is Nodejs, no need for installing Phanthomjs on your machine; so far you have node installed, you are good to go.

mongoose

Mongoose is a MongoDB object modeling tool designed to work in an asynchronous environment. It’s an Object Data Modeling library that gives a modeling environment for MongoDB and it enforces a more structured data model.

Mongoose gives us the ability to create MongoDB data Models and Schemas.

We are going to use mongoose to save the information extracted from a page to the MongoDB database.

Project Structure

The structure of this project would look like this.

    ├── config
    │   ├── db.js
    │   └── wiki.js
    ├── crawl.js
    ├── package.json
    ├── package-lock.json
    └── readme.md

crawler/config

The main file in the crawler/config folder is the db.js. This file contains the configuration for our database. The wiki.js is the javascript file that will hold the configuration for web-crawljs.

Apart from db.js, all other files are configurations for web-crawljs.

crawler/config

The main file in the crawler/config folder is the db.js. This file contains the configuration for our database. The wiki.js is the javascript file that will hold the configuration for web-crawljs.

Apart from db.js, all other files are configurations for web-crawljs.

What we will crawl

In this article, we are going to extract some information out of Wikipedia and save it to a MongoDB database. The information we want to extract from the page are:

title of the wiki content
content of the wiki page
all the reference link

Requirements

For this tutorial, Nodejs and MongoDB must be installed on your machine. And I’ll be making use of node 7.8.0 and MongoDB version 2.6.10. I am also making use of ES6 syntax (arrow function, destructuring e.t.c).

node >=v7.0.0
mongodb

Let’s get started

Now let’s go straight to business. we will start by creating a new folder called crawler

$ mkdir crawler
$ cd crawler #move into the folder

Now that it is done, let’s create the config directory inside the crawler directory

$ mkdir config
#create the config files
$ touch config/wiki.js config/db.js
#create the crawl.js file
$ touch crawl.js

time to create the package.json file. use the npm init -y command to create the package.json (using it because it’s easy).

$ npm init -y

Installing the dependencies

We are making use of only three dependencies in this project, the mongoose, commander, and web-crawljs module. To install this module we will use our good friend npm. run npm install --save web-crawljs mongoose to install the dependencies.

$ npm install --save web-crawljs mongoose commander

Now that it is installed let's move to next stuff

config/db.js

This file holds the configuration details of our MongoDB database

/**
 * Created by kayslay on 6/3/17.
 */
module.exports = {
    dbName: "crawl",
    dbHost: "localhost",
};

config/wiki.js

The config/wiki.js file holds the configuration we will use to crawl our Wikipedia page.

/**
 * Created by kayslay on 6/3/17.
 */
const mongoose = require('mongoose');
const dbConfig = require('../config/db');
//mongoose configs
const Schema = mongoose.Schema;
//creating a schema for the extracted data
const wikiSchema = new Schema({
    title: String,
    body: String,
    references: [String]
});
//connect to mongo db
mongoose.connect(`mongodb://${dbConfig.dbHost}/${dbConfig.dbName}`);
//create the model
const wikiModel = mongoose.model('Wiki', wikiSchema);

//crawl config
module.exports = {
    //the selectors on page we want to select
    //here we are selecting the title, a div with an id of mw-content-text and links with a
    //class name of external and text
    fetchSelector: {title: "title", body: "div#mw-content-text",references: 'a.external.text'},
    //what we want to select from the selector
    //for the title and body we want the text
    //for the references we want to get the href of the links
    fetchSelectBy: {title: "text", body: "text",references:['attr','href']},
    // the same rules apply to the nextSelector and nextSelectBy
    //but this is used to get the links of the page to crawl next
    nextSelector: {links: 'a[href^="/wiki"]'},
    nextSelectBy: {links: ['attr','href']},
    //this changes the next selector when the links match .svg
    dynamicSchemas:{
        nextSelector:[{url:/\.svg/,schema:{links:""}}]
    },
    //formats the url
    formatUrl: function (url) {
        if((/\.svg?/.test(url) || /[A-Z]\w+:\w+?/.test(url))){
            //returning a visited string so that it does not visit the link
            //when the url ends with `.svg` or something like `Wikipedia:About`
        return 'https://en.wikipedia.org/wiki/Web_crawler/'
        }
        return url;
    },
    //what we want to do with the data extracted from the page
    //we want to save it to a mongodb database
    fetchFn: (err, data, url) => {

        if (err) {
            return console.error(err.message);
        }
        let {title, body, references} = data;
        let wikiData = {title: title[0], body: body[0], references};
        wikiModel.create(wikiData, function (err, wiki) {
            console.log(`page with a title ${wiki.title}, has been saved to the database`);
        });
    },
    //called at the end of the whole crawl
    finalFn: function () {
        console.log('finished crawling wiki');
    },
    depth: 3, //how deep the crawl should go
    limitNextLinks: 10,// limit the amount of links we get from wikipedia to 10. this helps when you dont want to get all the links
    urls: ['https://en.wikipedia.org/wiki/Web_crawler/'] //the default urls to crawl if one is not specified
};

crawl.js

#!/usr/bin/env node
/**
 * Created by kayslay on 5/31/17.
 */
const crawler = require('web-crawljs');
const program = require('commander');

//commander configuration
function list(val) {
    "use strict";
    return val.split(',');
}

program
    .option('-x --execute <string>', 'the configurtion to execute')
    .option('-d --depth [number]', 'the depth of the crawl')
    .option('-u --urls [items]', 'change the urls',list)
    .parse(process.argv);

//throw an error if the execute flag is not used
if (!program.execute) {
    throw new Error('the configuration to use must be set use the -x flag to define configuration;' +
        ' use the --help for help')
}
//holds the additional configuration that will be added to crawlConfig
const additionalConfig = {};

//set the object that will override the default crawlConfig
(function (config) {
    //depth
    if (program.depth) config['depth'] = program.depth;
    if(!!program.urls) config['urls'] = program.urls

})(additionalConfig);

//the action is the file name that holds the crawlConfig
let action = program.execute;


try {
    //set the crawlConfig 
    //adds the additional config if need
    let crawlConfig = Object.assign(require(`./config/${action}`), additionalConfig);
    const Crawler = crawler(crawlConfig);
    Crawler.CrawlAllUrl()
} catch (err) {
    console.error(`An Error occurred: ${err.message}`);
}

The crawl.js file is the main file of this project. This file is what we will run using the node command. It’s our entry point.

It depends on two packages: web-crawljs and commander. Which were imported on lines 5 and 6.

From line 9 to line 18 we set up the flags needed to be used by our CLI.

Thanks to commander this is very easy to achieve. Check its documentation for more.

Line 21 all the way down to line 37, configures the values gotten from the CLI.

The comment in the file should explain what's going on.

The lines that follow just performs the web crawl operation.

Let’s test our crawler

Now that's all the code has been written, it’s time to test the crawler.

Type the following in your terminal

$ node crawl.js -x wiki

When we check our MongoDB collection we will see the title, body and reference added to it.

Instead of using the default Wikipedia URL, we are going to use our own wiki page URL.

$ node crawl -u https://en.wikipedia.org/wiki/Web_crawler -x wiki

This will not start crawling from the default https://en.wikipedia.org/ but would start crawling from https://en.wikipedia.org/wiki/Web_crawler.
To add more URLs, separate the URLs by commas.

Conclusion

We now know how to create a web crawler using web-crawljs, commander and mongoose :).

And to those who don’t know how easy it is to create a Command Line Interface with Nodejs is; Now you know.

This is at least one more thing you know.

Thanks for reading and please recommend this post.

Top comments (7)

Comment deleted

Badewa kayode • Apr 21 '20

I don't understand the error. Send your config file so I can take a look at it.

Zoey de Souza Pessanha • Apr 21 '20

Well, here's the project's repository: github.com/Mdsp9070/CLI-crawler

Badewa kayode • Apr 21 '20

It works on my machine. From the error above, it seems to be a connection error ECONNREFUSED.

Check the MongoDB server if it's running as expected. Also, check if the config contains the right credentials to connect to your local MongoDB server.

Zoey de Souza Pessanha • Apr 21 '20

There's any tutorial on how to setup everything into he right way? Mongo's docs seems so confusing

Badewa kayode • Apr 21 '20

Yeah. Make sure ur MongoDB server is running first.

Take a look at Mongoose's docs mongoosejs.com/

Crawlbase • Apr 11 '24

Thanks for this insightful tutorial on building a web crawler using web-crawljs! It's great to see how you've simplified the process with Node.js and provided clear instructions. For those looking to enhance their web crawling experience, Crawlbase offers reliable proxy solutions to ensure smooth data extraction.