All of us use search engines almost daily. When most of us talk about search engines, we really mean the World Wide Web search engines. A very superficial overview of a search engine suggests that a user enters a search term and the search engine gives a list of all relevant resources related to that search term. But, to provide the user with a list of resources the search engine should know that they exist on the Internet.
Search engines do not magically know what websites exist on the Internet. So this is exactly where the role of web crawlers comes into the picture.
What is a web crawler?
A web crawler is a bot that downloads and indexes content from all over the Internet. The aim of such a bot is to get a brief idea about or know what every webpage on the Internet is about so that retrieving the information becomes easy when needed.
The web crawler is like a librarian organizing the books in a library making a catalog of those books so that it becomes easier for the reader to find a particular book. To categorize a book, the librarian needs to read its topic, summary, and some part of the content if required.
How does a web crawler work?
It is very difficult to know how many webpages exist on the Internet in total. A web crawler starts with a certain number of known URLs and as it crawls that webpage, it finds links to other webpages. A web crawler follows certain policies to decide what to crawl and how frequently to crawl.
Which webpages to crawl first is also decided by considering some parameters. For instance, webpages with a lot of visitors are a good option to start with, and that a search engine has it indexed.
Building a simple web crawler with Node.js and JavaScript
We will be using the modules cheerio
and request
.
Install these dependencies using the following commands
npm install --save cheerio
npm install --save request
The following code imports the required modules and makes a request to Hacker News.
We log the status code of the response to the console to see if the request was successful.
var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');
request("https://news.ycombinator.com/news", (error, response, body) => {
if(error) {
console.log("Error: " + error);
}
console.log("Status code: " + response.statusCode);
});
Note that the fs
module is used to handle files and it is a built-in module.
We observe the structure of the data using the developer tools of our browser. We see that there are tr
elements with athing
class.
We will go through all the tr.athing
elements and get the title of the post by selecting the child td.title
element and the hyperlink by selecting the a
element. This task is accomplished by adding the following code after the console.log
of the previous code block
var $ = cheerio.load(body);
$('tr.athing:has(td.votelinks)').each(function( index ) {
var title = $(this).find('td.title > a').text().trim();
var link = $(this).find('td.title > a').attr('href');
fs.appendFileSync('hackernews.txt', title + '\n' + link + '\n');
});
We also skip the posts related to hiring (if observed carefully, we see that the tr.athing
element of such post does not have a child td.votelinks
element.
The complete code now looks like following
var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');
request("https://news.ycombinator.com/news", function(error, response, body) {
if(error) {
console.log("Error: " + error);
}
console.log("Status code: " + response.statusCode);
var $ = cheerio.load(body);
$('tr.athing:has(td.votelinks)').each(function( index ) {
var title = $(this).find('td.title > a').text().trim();
var link = $(this).find('td.title > a').attr('href');
fs.appendFileSync('hackernews.txt', title + '\n' + link + '\n');
});
});
The data is stored in a file named hackernews.txt
Your simple web crawler is ready!!
Top comments (0)