<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pranav MM</title>
    <description>The latest articles on DEV Community by Pranav MM (@pranavmuttathil).</description>
    <link>https://dev.to/pranavmuttathil</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1599947%2F9dca90e8-d3c7-4938-af78-ae94dbbef01b.jpeg</url>
      <title>DEV Community: Pranav MM</title>
      <link>https://dev.to/pranavmuttathil</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pranavmuttathil"/>
    <language>en</language>
    <item>
      <title>Web Scraping Vs Web Crawling</title>
      <dc:creator>Pranav MM</dc:creator>
      <pubDate>Wed, 12 Jun 2024 17:19:48 +0000</pubDate>
      <link>https://dev.to/pranavmuttathil/web-scraping-vs-web-crawling-2me3</link>
      <guid>https://dev.to/pranavmuttathil/web-scraping-vs-web-crawling-2me3</guid>
      <description>&lt;h2&gt;
  
  
  Web Scraping or Web Crawling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Search and gather&lt;/strong&gt; Aka crawling and scraping refers to the acquisition of important website data by the use of automated bots. Web scraping is pretty common to track and analyze data and compare to its former self, Examples may include the &lt;strong&gt;&lt;em&gt;Market data, finance, E-Commerce and Retail&lt;/em&gt;&lt;/strong&gt; . Now you may ask, What exactly does it mean to crawl a website, What does it mean to Scrap a website?&lt;/p&gt;

&lt;h3&gt;
  
  
  | How is it related to each other?
&lt;/h3&gt;

&lt;p&gt;Suppose you have a Gmail with no storage left (Which I hope you don't) and you wish to acquire one important file, What would you do? You would &lt;strong&gt;&lt;del&gt;Give up&lt;/del&gt;&lt;/strong&gt; Start to go through each file and &lt;em&gt;&lt;strong&gt;Stalin sort&lt;/strong&gt;&lt;/em&gt; the files to get the right one. This exact combined action of seperating and acquiring the important data translates to a webpage cohesively which is termed by &lt;strong&gt;Crawling and Gathering&lt;/strong&gt;. &lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Good, the Bad and the Wayback machine&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Established in 1996, by Brewster Kahle and Bruce Gilliat, The wayback machine aka &lt;strong&gt;The internet Archive&lt;/strong&gt; aka the warehouse of digital content that has seen its testament of time. It allows users to access the archvied versions of the website, evenn allowing you to navigate the website through its establishment.  It works by sending automated &lt;strong&gt;web crawlers&lt;/strong&gt; to various &lt;strong&gt;publicly available websites&lt;/strong&gt; amd taking snapshots. It can be easily accessed and used by all, at &lt;a href="https://wayback-api.archive.org/"&gt;https://wayback-api.archive.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo526iumm1r3vrfikgfeb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo526iumm1r3vrfikgfeb.png" width="800" height="386"&gt;&lt;/a&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  What it can't store
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;"&lt;em&gt;&lt;strong&gt;With large data comes big storage bills&lt;/strong&gt;&lt;/em&gt;", With a infinite pile of information coming up on its doorsteps, its storage capabilites have increased tenfolds. As of January 2024, It stores around &lt;strong&gt;99 Petabytes&lt;/strong&gt;, and is expected to increase about &lt;strong&gt;100 Terabytes per month&lt;/strong&gt;, such renders the Internet Archive unable to store the following &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dynamic Pages &lt;/li&gt;
&lt;li&gt;Emails&lt;/li&gt;
&lt;li&gt;Chats&lt;/li&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;li&gt;Classified Military Content &lt;em&gt;(Obviously)&lt;/em&gt;


&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;



&lt;blockquote&gt;
&lt;h1&gt;
  
  
  &lt;em&gt;"Talk is Cheap. Show me the Code"&lt;/em&gt;
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  -Linus Torvalds
&lt;/h2&gt;
&lt;/blockquote&gt;
&lt;/blockquote&gt;

&lt;p&gt;Creating your own time capsule is very easy by setting up a Web Crawler that preys into the website and collects data at regular intervals of time. Creation of your own bot for scraping is easily achieveable using various libraries like &lt;strong&gt;BeauitfulSoup&lt;/strong&gt; (for &lt;strong&gt;Python&lt;/strong&gt;) and &lt;strong&gt;Cheerio&lt;/strong&gt; (for &lt;strong&gt;Javascript&lt;/strong&gt;)&lt;/p&gt;
&lt;h2&gt;
  
  
  For Python Enthusiasts
&lt;/h2&gt;

&lt;p&gt;| You can install the libraries installed using the following &lt;strong&gt;pip&lt;/strong&gt; command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;beautifulsoup4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It utilises &lt;/p&gt;

&lt;h3&gt;
  
  
  | Code:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bs4&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BeautifulSoup&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crawl_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BeautifulSoup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html.parser&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;links&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a_tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;href&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a_tag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;

&lt;span class="n"&gt;seed_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://en.wikipedia.org/wiki/Ludic_fallacy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;visited_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="n"&gt;crawl_depth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;visited_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;
  &lt;span class="n"&gt;visited_urls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;links&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;crawl_page&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;crawl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://en.wikipedia.org/wiki/Ludic_fallacy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Crawled URLs:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;visited_urls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="sb"&gt;``&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  For Javascript Enthusiasts
&lt;/h2&gt;

&lt;p&gt;| Prerequisites include libraries such as &lt;strong&gt;Axios&lt;/strong&gt; and &lt;strong&gt;Cheerio&lt;/strong&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

npm install axios cheerio


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Axios fulfills the job of making HTTP Requests to the website while Cheerio manipulates the incoming website data and allows you to extract valuable information using &lt;strong&gt;CSS-Style selectors&lt;/strong&gt; which stores the extracted data as JSON files as objects with properties &lt;/p&gt;

&lt;h3&gt;
  
  
  | Code:
&lt;/h3&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';

async function scrapeData() {
  try {
    const response = await axios.get(targetUrl);
    const html = response.data;
    const $ = cheerio.load(html);
    const titles = $('h1').text().trim();
    const descriptions = $('p').text().trim();
    console.log('Titles:', titles);
    console.log('Descriptions:', descriptions);
  } catch (error) {
    console.error('Error scraping data:', error);
  }
}

scrapeData();


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure to be mindful of the website's &lt;strong&gt;terms and conditions&lt;/strong&gt; and abide by by the &lt;strong&gt;robots.txt&lt;/strong&gt; to pratice ethical scraping and to prevent yourself from legal trouble and have fun coding along the way.&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>webcrawling</category>
      <category>javascript</category>
      <category>python</category>
    </item>
  </channel>
</rss>
