DEV Community

Rahul Pathak
Rahul Pathak

Posted on

What is a Web Crawler? | Building a simple web crawler with Node.js and JavaScript

All of us use search engines almost daily. When most of us talk about search engines, we really mean the World Wide Web search engines. A very superficial overview of a search engine suggests that a user enters a search term and the search engine gives a list of all relevant resources related to that search term. But, to provide the user with a list of resources the search engine should know that they exist on the Internet.
Search engines do not magically know what websites exist on the Internet. So this is exactly where the role of web crawlers comes into the picture.

What is a web crawler?

A web crawler is a bot that downloads and indexes content from all over the Internet. The aim of such a bot is to get a brief idea about or know what every webpage on the Internet is about so that retrieving the information becomes easy when needed.
The web crawler is like a librarian organizing the books in a library making a catalog of those books so that it becomes easier for the reader to find a particular book. To categorize a book, the librarian needs to read its topic, summary, and some part of the content if required.

How does a web crawler work?

It is very difficult to know how many webpages exist on the Internet in total. A web crawler starts with a certain number of known URLs and as it crawls that webpage, it finds links to other webpages. A web crawler follows certain policies to decide what to crawl and how frequently to crawl.
Which webpages to crawl first is also decided by considering some parameters. For instance, webpages with a lot of visitors are a good option to start with, and that a search engine has it indexed.

Building a simple web crawler with Node.js and JavaScript

We will be using the modules cheerio and request.
Install these dependencies using the following commands

npm install --save cheerio
npm install --save request
Enter fullscreen mode Exit fullscreen mode

The following code imports the required modules and makes a request to Hacker News.
We log the status code of the response to the console to see if the request was successful.

var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');

request("https://news.ycombinator.com/news", (error, response, body) => {
  if(error) {
    console.log("Error: " + error);
  }
  console.log("Status code: " + response.statusCode);
});
Enter fullscreen mode Exit fullscreen mode

Note that the fs module is used to handle files and it is a built-in module.

We observe the structure of the data using the developer tools of our browser. We see that there are tr elements with athing class.

Alt Text

We will go through all the tr.athing elements and get the title of the post by selecting the child td.title element and the hyperlink by selecting the a element. This task is accomplished by adding the following code after the console.log of the previous code block

  var $ = cheerio.load(body);

  $('tr.athing:has(td.votelinks)').each(function( index ) {
    var title = $(this).find('td.title > a').text().trim();
    var link = $(this).find('td.title > a').attr('href');
    fs.appendFileSync('hackernews.txt', title + '\n' + link + '\n');
  });
Enter fullscreen mode Exit fullscreen mode

We also skip the posts related to hiring (if observed carefully, we see that the tr.athing element of such post does not have a child td.votelinks element.

Alt Text

The complete code now looks like following

var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');

request("https://news.ycombinator.com/news", function(error, response, body) {
  if(error) {
    console.log("Error: " + error);
  }
  console.log("Status code: " + response.statusCode);

  var $ = cheerio.load(body);

  $('tr.athing:has(td.votelinks)').each(function( index ) {
    var title = $(this).find('td.title > a').text().trim();
    var link = $(this).find('td.title > a').attr('href');
    fs.appendFileSync('hackernews.txt', title + '\n' + link + '\n');
  });

});
Enter fullscreen mode Exit fullscreen mode

The data is stored in a file named hackernews.txt

Alt Text
Your simple web crawler is ready!!

References

Top comments (0)