Moving Articles from medium to dev.to
In this tutorial, we are going to create a web crawler that scraps information from Wikipedia pages. This web crawler would run
from a command-line interface (e.g. terminal, command prompt).
The code for this article is on github.
An example of the command that would crawl the page would look like
$ node crawl.js -d 3 -x wiki
The command will get a config file named wiki
and saves the crawled data to a MongoDB collection called wiki
.
Web Crawling
Web crawlers are programs written to get information from a web page.
“A Web crawler, sometimes called a spider, is an Internet bot that systematically
browses the World Wide Web, typically for the purpose of Web indexing”
— Wikipedia
What we will be needing
For this project, we will be needing commander
, web-crawljs
, and mongoose
.
Commander
Commander is an npm module that makes working with the command line interface easier. It makes it easy to handle command-line arguments. Check out it’s documentation.
web-crawljs
web-crawljs is an npm module that crawls web pages and extracts information from the page. It makes crawling web pages with Nodejs easy.
The only thing that web-crawljs needs, is a configuration object for it to start crawling.
why web-crawljs
One of the reasons I chose web-crawljs is because of how easy it is to crawl web pages with it. It is also a light weight web crawler, that is, it uses far less CPU and RAM compared to using a headless browser (e.g Phanthomjs).
Due to the less CPU and RAM usage, it can not render a SPA (single page application) pages. And also because I built It :).
All that is required to run it is Nodejs, no need for installing Phanthomjs on your machine; so far you have node installed, you are good to go.
mongoose
Mongoose is a MongoDB object modeling tool designed to work in an asynchronous environment. It’s an Object Data Modeling library that gives a modeling environment for MongoDB and it enforces a more structured data model.
Mongoose gives us the ability to create MongoDB data Models and Schemas.
We are going to use mongoose to save the information extracted from a page to the MongoDB database.
Project Structure
The structure of this project would look like this.
├── config
│ ├── db.js
│ └── wiki.js
├── crawl.js
├── package.json
├── package-lock.json
└── readme.md
crawler/config
The main file in the crawler/config folder is the db.js. This file contains the configuration for our database. The wiki.js is the javascript file that will hold the configuration for web-crawljs.
Apart from db.js
, all other files are configurations for web-crawljs
.
crawler/config
The main file in the crawler/config folder is the db.js. This file contains the configuration for our database. The wiki.js is the javascript file that will hold the configuration for web-crawljs.
Apart from db.js, all other files are configurations for web-crawljs.
What we will crawl
In this article, we are going to extract some information out of Wikipedia and save it to a MongoDB database. The information we want to extract from the page are:
- title of the wiki content
- content of the wiki page
- all the reference link
Requirements
For this tutorial, Nodejs and MongoDB must be installed on your machine. And I’ll be making use of node 7.8.0 and MongoDB version 2.6.10. I am also making use of ES6 syntax (arrow function, destructuring e.t.c).
- node >=v7.0.0
- mongodb
Let’s get started
Now let’s go straight to business. we will start by creating a new folder called crawler
$ mkdir crawler
$ cd crawler #move into the folder
Now that it is done, let’s create the config directory inside the crawler directory
$ mkdir config
#create the config files
$ touch config/wiki.js config/db.js
#create the crawl.js file
$ touch crawl.js
time to create the package.json file. use the npm init -y command to create the package.json (using it because it’s easy).
$ npm init -y
Installing the dependencies
We are making use of only three dependencies in this project, the mongoose, commander, and web-crawljs module. To install this module we will use our good friend npm. run npm install --save web-crawljs mongoose to install the dependencies.
$ npm install --save web-crawljs mongoose commander
Now that it is installed let's move to next stuff
config/db.js
This file holds the configuration details of our MongoDB database
/**
* Created by kayslay on 6/3/17.
*/
module.exports = {
dbName: "crawl",
dbHost: "localhost",
};
config/wiki.js
The config/wiki.js
file holds the configuration we will use to crawl our Wikipedia page.
/**
* Created by kayslay on 6/3/17.
*/
const mongoose = require('mongoose');
const dbConfig = require('../config/db');
//mongoose configs
const Schema = mongoose.Schema;
//creating a schema for the extracted data
const wikiSchema = new Schema({
title: String,
body: String,
references: [String]
});
//connect to mongo db
mongoose.connect(`mongodb://${dbConfig.dbHost}/${dbConfig.dbName}`);
//create the model
const wikiModel = mongoose.model('Wiki', wikiSchema);
//crawl config
module.exports = {
//the selectors on page we want to select
//here we are selecting the title, a div with an id of mw-content-text and links with a
//class name of external and text
fetchSelector: {title: "title", body: "div#mw-content-text",references: 'a.external.text'},
//what we want to select from the selector
//for the title and body we want the text
//for the references we want to get the href of the links
fetchSelectBy: {title: "text", body: "text",references:['attr','href']},
// the same rules apply to the nextSelector and nextSelectBy
//but this is used to get the links of the page to crawl next
nextSelector: {links: 'a[href^="/wiki"]'},
nextSelectBy: {links: ['attr','href']},
//this changes the next selector when the links match .svg
dynamicSchemas:{
nextSelector:[{url:/\.svg/,schema:{links:""}}]
},
//formats the url
formatUrl: function (url) {
if((/\.svg?/.test(url) || /[A-Z]\w+:\w+?/.test(url))){
//returning a visited string so that it does not visit the link
//when the url ends with `.svg` or something like `Wikipedia:About`
return 'https://en.wikipedia.org/wiki/Web_crawler/'
}
return url;
},
//what we want to do with the data extracted from the page
//we want to save it to a mongodb database
fetchFn: (err, data, url) => {
if (err) {
return console.error(err.message);
}
let {title, body, references} = data;
let wikiData = {title: title[0], body: body[0], references};
wikiModel.create(wikiData, function (err, wiki) {
console.log(`page with a title ${wiki.title}, has been saved to the database`);
});
},
//called at the end of the whole crawl
finalFn: function () {
console.log('finished crawling wiki');
},
depth: 3, //how deep the crawl should go
limitNextLinks: 10,// limit the amount of links we get from wikipedia to 10. this helps when you dont want to get all the links
urls: ['https://en.wikipedia.org/wiki/Web_crawler/'] //the default urls to crawl if one is not specified
};
crawl.js
#!/usr/bin/env node
/**
* Created by kayslay on 5/31/17.
*/
const crawler = require('web-crawljs');
const program = require('commander');
//commander configuration
function list(val) {
"use strict";
return val.split(',');
}
program
.option('-x --execute <string>', 'the configurtion to execute')
.option('-d --depth [number]', 'the depth of the crawl')
.option('-u --urls [items]', 'change the urls',list)
.parse(process.argv);
//throw an error if the execute flag is not used
if (!program.execute) {
throw new Error('the configuration to use must be set use the -x flag to define configuration;' +
' use the --help for help')
}
//holds the additional configuration that will be added to crawlConfig
const additionalConfig = {};
//set the object that will override the default crawlConfig
(function (config) {
//depth
if (program.depth) config['depth'] = program.depth;
if(!!program.urls) config['urls'] = program.urls
})(additionalConfig);
//the action is the file name that holds the crawlConfig
let action = program.execute;
try {
//set the crawlConfig
//adds the additional config if need
let crawlConfig = Object.assign(require(`./config/${action}`), additionalConfig);
const Crawler = crawler(crawlConfig);
Crawler.CrawlAllUrl()
} catch (err) {
console.error(`An Error occurred: ${err.message}`);
}
The crawl.js file is the main file of this project. This file is what we will run using the node
command. It’s our entry point.
It depends on two packages: web-crawljs and commander. Which were imported on lines 5 and 6.
From line 9 to line 18 we set up the flags needed to be used by our CLI.
Thanks to commander this is very easy to achieve. Check its documentation for more.
Line 21 all the way down to line 37, configures the values gotten from the CLI.
The comment in the file should explain what's going on.
The lines that follow just performs the web crawl operation.
Let’s test our crawler
Now that's all the code has been written, it’s time to test the crawler.
Type the following in your terminal
$ node crawl.js -x wiki
When we check our MongoDB collection we will see the title, body and reference added to it.
Instead of using the default Wikipedia URL, we are going to use our own wiki page URL.
$ node crawl -u https://en.wikipedia.org/wiki/Web_crawler -x wiki
This will not start crawling from the default https://en.wikipedia.org/ but would start crawling from https://en.wikipedia.org/wiki/Web_crawler
.
To add more URLs, separate the URLs by commas.
Conclusion
We now know how to create a web crawler using web-crawljs
, commander
and mongoose
:).
And to those who don’t know how easy it is to create a Command Line Interface with Nodejs is; Now you know.
This is at least one more thing you know.
Thanks for reading and please recommend this post.
Top comments (7)
I don't understand the error. Send your config file so I can take a look at it.
Well, here's the project's repository: github.com/Mdsp9070/CLI-crawler
It works on my machine. From the error above, it seems to be a connection error ECONNREFUSED.
Check the MongoDB server if it's running as expected. Also, check if the config contains the right credentials to connect to your local MongoDB server.
There's any tutorial on how to setup everything into he right way? Mongo's docs seems so confusing
Yeah. Make sure ur MongoDB server is running first.
Take a look at Mongoose's docs mongoosejs.com/
Thanks for this insightful tutorial on building a web crawler using web-crawljs! It's great to see how you've simplified the process with Node.js and provided clear instructions. For those looking to enhance their web crawling experience, Crawlbase offers reliable proxy solutions to ensure smooth data extraction.