<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rahul Pathak</title>
    <description>The latest articles on DEV Community by Rahul Pathak (@rahulpathakgithub).</description>
    <link>https://dev.to/rahulpathakgithub</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F514352%2Fc853f12b-8e74-4fd5-a488-c01b85ac4205.jpeg</url>
      <title>DEV Community: Rahul Pathak</title>
      <link>https://dev.to/rahulpathakgithub</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rahulpathakgithub"/>
    <language>en</language>
    <item>
      <title>What is a Web Crawler? | Building a simple web crawler with Node.js and JavaScript</title>
      <dc:creator>Rahul Pathak</dc:creator>
      <pubDate>Sun, 15 Nov 2020 16:56:31 +0000</pubDate>
      <link>https://dev.to/rahulpathakgithub/what-is-a-web-crawler-building-a-simple-web-crawler-with-node-js-and-javascript-4p56</link>
      <guid>https://dev.to/rahulpathakgithub/what-is-a-web-crawler-building-a-simple-web-crawler-with-node-js-and-javascript-4p56</guid>
      <description>&lt;p&gt;All of us use search engines almost daily. When most of us talk about search engines, we really mean the World Wide Web search engines. A very superficial overview of a search engine suggests that a user enters a &lt;em&gt;search term&lt;/em&gt; and the search engine gives a list of all relevant resources related to that &lt;em&gt;search term&lt;/em&gt;. But, to provide the user with a list of resources the search engine should know that they exist on the Internet.&lt;br&gt;
Search engines do not magically know what websites exist on the Internet. So this is exactly where the role of web crawlers comes into the picture.&lt;/p&gt;
&lt;h2&gt;
  
  
  What is a &lt;em&gt;web crawler&lt;/em&gt;?
&lt;/h2&gt;

&lt;p&gt;A &lt;em&gt;web crawler&lt;/em&gt; is a bot that downloads and indexes content from all over the Internet. The aim of such a bot is to get a brief idea about or know what every webpage on the Internet is about so that retrieving the information becomes easy when needed.&lt;br&gt;
The web crawler is like a librarian organizing the books in a library making a catalog of those books so that it becomes easier for the reader to find a particular book. To categorize a book, the librarian needs to read its topic, summary, and some part of the content if required. &lt;/p&gt;
&lt;h2&gt;
  
  
  How does a &lt;em&gt;web crawler&lt;/em&gt; work?
&lt;/h2&gt;

&lt;p&gt;It is very difficult to know how many webpages exist on the Internet in total. A web crawler starts with a certain number of known URLs and as it crawls that webpage, it finds links to other webpages. A web crawler follows certain policies to decide what to crawl and how frequently to crawl.&lt;br&gt;
Which webpages to crawl first is also decided by considering some parameters. For instance, webpages with a lot of visitors are a good option to start with, and that a search engine has it indexed.&lt;/p&gt;
&lt;h2&gt;
  
  
  Building a simple web crawler with Node.js and JavaScript
&lt;/h2&gt;

&lt;p&gt;We will be using the modules &lt;code&gt;cheerio&lt;/code&gt; and &lt;code&gt;request&lt;/code&gt;.&lt;br&gt;
Install these dependencies using the following commands&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm install --save cheerio
npm install --save request
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following code imports the required modules and makes a request to &lt;a href="https://news.ycombinator.com/news" rel="noopener noreferrer"&gt;Hacker News&lt;/a&gt;.&lt;br&gt;
We log the status code of the response to the console to see if the request was successful.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');

request("https://news.ycombinator.com/news", (error, response, body) =&amp;gt; {
  if(error) {
    console.log("Error: " + error);
  }
  console.log("Status code: " + response.statusCode);
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that the &lt;code&gt;fs&lt;/code&gt; module is used to handle files and it is a built-in module.&lt;/p&gt;

&lt;p&gt;We observe the structure of the data using the developer tools of our browser. We see that there are &lt;code&gt;tr&lt;/code&gt; elements with &lt;code&gt;athing&lt;/code&gt; class. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fd1q3mf4fnpw7vvbryg0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fd1q3mf4fnpw7vvbryg0g.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will go through all the &lt;code&gt;tr.athing&lt;/code&gt; elements and get the title of the post by selecting the child &lt;code&gt;td.title&lt;/code&gt; element and the hyperlink by selecting the &lt;code&gt;a&lt;/code&gt; element. This task is accomplished by adding the following code after the &lt;code&gt;console.log&lt;/code&gt; of the previous code block&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  var $ = cheerio.load(body);

  $('tr.athing:has(td.votelinks)').each(function( index ) {
    var title = $(this).find('td.title &amp;gt; a').text().trim();
    var link = $(this).find('td.title &amp;gt; a').attr('href');
    fs.appendFileSync('hackernews.txt', title + '\n' + link + '\n');
  });
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also skip the posts related to hiring (if observed carefully, we see that the &lt;code&gt;tr.athing&lt;/code&gt; element of such post does not have a child &lt;code&gt;td.votelinks&lt;/code&gt; element.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fir5gq5hqp4gegdd9zqu6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fir5gq5hqp4gegdd9zqu6.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The complete code now looks like following&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;var request = require('request');
var cheerio = require('cheerio');
var fs = require('fs');

request("https://news.ycombinator.com/news", function(error, response, body) {
  if(error) {
    console.log("Error: " + error);
  }
  console.log("Status code: " + response.statusCode);

  var $ = cheerio.load(body);

  $('tr.athing:has(td.votelinks)').each(function( index ) {
    var title = $(this).find('td.title &amp;gt; a').text().trim();
    var link = $(this).find('td.title &amp;gt; a').attr('href');
    fs.appendFileSync('hackernews.txt', title + '\n' + link + '\n');
  });

});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data is stored in a file named &lt;code&gt;hackernews.txt&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1uggj1wby5ydyhc3dhgn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F1uggj1wby5ydyhc3dhgn.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;br&gt;
Your simple web crawler is ready!!&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.cloudflare.com/learning/bots/what-is-a-web-crawler/" rel="noopener noreferrer"&gt;https://www.cloudflare.com/learning/bots/what-is-a-web-crawler/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.netinstructions.com/simple-web-scraping-with-node-js-and-javascript/" rel="noopener noreferrer"&gt;http://www.netinstructions.com/simple-web-scraping-with-node-js-and-javascript/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://computer.howstuffworks.com/internet/basics/search-engine1.htm" rel="noopener noreferrer"&gt;https://computer.howstuffworks.com/internet/basics/search-engine1.htm&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
    </item>
  </channel>
</rss>
