DEV Community

SHOUVIK
SHOUVIK

Posted on

WEB CRAWLING

INTRO

Internet the world wide web generates 10Exabytes of information per day, but have we ever wondered how does our search engines provide us so relevant and expected information, while we search or how does our content or and blogs get priority and are available at first page of the search.

In internet there are bots which are called spiders/crawlers which reads information we upload in internet, the spider then index the information, and provides the information relevancy by ranks of the page.
Alt Text
Sounds curious. Let’s know how?

Search Engine Fundamentals (keywords).

Search Engine undergoes three process when information is uploaded into internet.
Crawling: Discover the information.
Indexing: organize the information.
Ranking: Available at the first page of the search among first 10 column.

Crawling

A search engine navigates the web by downloading web pages following links on these pages to discover new pages. Crawlers also discover new pages by crawling sitemaps.

Indexing

Organization of things, saves time and helps get the target one is looking for. Indexing is way of organization of information by search engine and help us retrieve relevant information one is looking for in World Wide Web.

Ranking:

World wide web is huge, providing relevant information and updated is challenge, so search engine use “Page Rank” to find out the appropriate information’s, value of the each page is calculated by counting of number of links pointing to that page and hence page ranks accordingly relatively to others pages.

CRAWL PAGES AND EXTRACT INFORMATION

Web scarping is one of the effective way of extracting information from the web and can be stored in formats such as excel, csv and the information can be used for decision making analysis.
Let’s learn to use Web Crawler using shell.
To build web crawler we can use Beautiful soup or Scrapy.

INSTALLATION

Scrapy supports python 2 and 3.

In case Anaconda, install package from conda-forge channel.

Install Scrapy using conda, run:

Conda install – conda –forge scrapy
In Linux use – pip install scrapy.

Scrapy provides web-crawling shell knows as scrapy shell, that can be used to learn web spider functions.

fetch (‘URL’) – A command in scrapy shell that downloads text and metadata.
Alt Text
To run the crawler in the shell type: use fetch a link in between brackets.

view(response)- A command which opens the webpage in the default browser.
print(response.text)- to view the raw HTML script in Scrapy.

Alt Text

The command results in opening the downloaded page in default browser.

Extracting title of the product from the developer tools
To open developer tools right click and tab on INSPECT

Alt Text

Find out the class which represents the title of the product

Alt Text response.css (‘span.a-size-extra-large::text’).extract()
Command to extract information of a particular section(products_name).
Alt Text

CONCLUSION

We can use crawler extract information from the websites.

Thanks for Reading.

Top comments (1)

Collapse
 
crawlbase profile image
Crawlbase

Thankyou! This blog breaks down the basics of web crawling in a simple way, helping us understand how search engines find and organize information online. Plus, it introduces handy tools like Scrapy for web scraping. And if you're looking to level up your web crawling game, platforms like Crawlbase have got your back.