Web Scraping Vs Web Crawling

Pranav MM — Wed, 12 Jun 2024 17:19:48 +0000

Web Scraping or Web Crawling

Search and gather Aka crawling and scraping refers to the acquisition of important website data by the use of automated bots. Web scraping is pretty common to track and analyze data and compare to its former self, Examples may include the Market data, finance, E-Commerce and Retail . Now you may ask, What exactly does it mean to crawl a website, What does it mean to Scrap a website?

| How is it related to each other?

Suppose you have a Gmail with no storage left (Which I hope you don't) and you wish to acquire one important file, What would you do? You would ~~Give up~~ Start to go through each file and Stalin sort the files to get the right one. This exact combined action of seperating and acquiring the important data translates to a webpage cohesively which is termed by Crawling and Gathering.

The Good, the Bad and the Wayback machine

Established in 1996, by Brewster Kahle and Bruce Gilliat, The wayback machine aka The internet Archive aka the warehouse of digital content that has seen its testament of time. It allows users to access the archvied versions of the website, evenn allowing you to navigate the website through its establishment. It works by sending automated web crawlers to various publicly available websites amd taking snapshots. It can be easily accessed and used by all, at https://wayback-api.archive.org/

What it can't store

"With large data comes big storage bills", With a infinite pile of information coming up on its doorsteps, its storage capabilites have increased tenfolds. As of January 2024, It stores around 99 Petabytes, and is expected to increase about 100 Terabytes per month, such renders the Internet Archive unable to store the following

Dynamic Pages

Emails

Chats

Databases

Classified Military Content (Obviously)

"Talk is Cheap. Show me the Code"

-Linus Torvalds

Creating your own time capsule is very easy by setting up a Web Crawler that preys into the website and collects data at regular intervals of time. Creation of your own bot for scraping is easily achieveable using various libraries like BeauitfulSoup (for Python) and Cheerio (for Javascript)

For Python Enthusiasts

| You can install the libraries installed using the following pip command

pip install beautifulsoup4

It utilises

| Code:

import requests
from bs4 import BeautifulSoup

def crawl_page(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.content, "html.parser")
  links = []
  for a_tag in soup.find_all("a", href=True):
    link = a_tag["href"]
    if link.startswith("http"):
      links.append(link)
  return links

seed_url = "https://en.wikipedia.org/wiki/Ludic_fallacy"
visited_urls = []
crawl_depth = 2

def crawl(url, depth):
  if depth == 0 or url in visited_urls:
    return
  visited_urls.append(url)
  links = crawl_page(url)
  for link in links:
    crawl(link, depth-1)

crawl("https://en.wikipedia.org/wiki/Ludic_fallacy", 2)
print("Crawled URLs:", visited_urls)```
{% endraw %}

{% raw %}

For Javascript Enthusiasts

| Prerequisites include libraries such as Axios and Cheerio



npm install axios cheerio

Axios fulfills the job of making HTTP Requests to the website while Cheerio manipulates the incoming website data and allows you to extract valuable information using CSS-Style selectors which stores the extracted data as JSON files as objects with properties

| Code:


javascript
const axios = require('axios');
const cheerio = require('cheerio');
const targetUrl = 'https://en.wikipedia.org/wiki/Ludic_fallacy';

async function scrapeData() {
  try {
    const response = await axios.get(targetUrl);
    const html = response.data;
    const $ = cheerio.load(html);
    const titles = $('h1').text().trim();
    const descriptions = $('p').text().trim();
    console.log('Titles:', titles);
    console.log('Descriptions:', descriptions);
  } catch (error) {
    console.error('Error scraping data:', error);
  }
}

scrapeData();

Make sure to be mindful of the website's terms and conditions and abide by by the robots.txt to pratice ethical scraping and to prevent yourself from legal trouble and have fun coding along the way.

DEV Community: Pranav MM

Web Scraping Vs Web Crawling

Web Scraping or Web Crawling

| How is it related to each other?

The Good, the Bad and the Wayback machine

What it can't store

"Talk is Cheap. Show me the Code"

-Linus Torvalds

For Python Enthusiasts

| Code:

For Javascript Enthusiasts

| Code: