loading...

Build a simple python web crawler

pranay749254 profile image pranay749254 ・3 min read

What is a Web Crawler?

Web crawler is an internet bot that is used for web indexing in World Wide Web.All types of search engines use web crawler to provide efficient results.Actually it collects all or some specific hyperlinks and HTML content from other websites and preview them in a suitable manner.When there are huge number of links to crawl , even the largest crawler fails.For this reason search engines early 2000 were bad at providing relevant results,but now this process has improved much and proper results are given in an instant

Python Web Crawler

The web crawler here is created in python3.Python is a high level programming language including object-oriented, imperative, functional programming and a large standard library.
For the web crawler two standard library are used - requests and BeautfulSoup4. requests provides a easy way to connect to world wide web and BeautifulSoup4 is used for some particular string operations.

Example Code

import requests
from bs4 import BeautifulSoup
def web(page,WebUrl):
    if(page>0):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        s = BeautifulSoup(plain, "html.parser")
        for link in s.findAll('a', {'class':'s-access-detail-page'}):
            tet = link.get('title')
            print(tet)
            tet_2 = link.get('href')
            print(tet_2)
web(1,'http://www.amazon.in/s/ref=s9_acss_bw_cts_VodooFS_T4_w?rh=i%3Aelectronics%2Cn%3A976419031%2Cn%3A%21976420031%2Cn%3A1389401031%2Cn%3A1389432031%2Cn%3A1805560031%2Cp_98%3A10440597031%2Cp_36%3A1500000-99999999&bbn=1805560031&rw_html_to_wsrp=1&pf_rd_m=A1K21FY43GMZF8&pf_rd_s=merchandised-search-3&pf_rd_r=2EKZMFFDEXJ5HE8RVV6E&pf_rd_t=101&pf_rd_p=c92c2f88-469b-4b56-936e-0e65f92eebac&pf_rd_i=1389432031')

Output:

C:\Python34\python.exe C:/Users/Babuya/PycharmProjects/Youtube/web_cr.py
Apple iPhone 6 (Gold, 32GB)
http://www.amazon.in/Apple-iPhone-6-Gold-32GB/dp/B0725RBY9V
OnePlus 5 (Slate Gray 6GB RAM + 64GB memory)
http://www.amazon.in/OnePlus-Slate-Gray-64GB-memory/dp/B01NAKTR2H
OnePlus 5 (Midnight Black 8GB RAM + 128GB memory)
http://www.amazon.in/OnePlus-Midnight-Black-128GB-memory/dp/B01MXZW51M
Apple iPhone 6 (Space Grey, 32GB)
http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01NCN4ICO
OnePlus 5 (Soft Gold, 6GB RAM + 64GB memory)
http://www.amazon.in/OnePlus-Soft-Gold-64GB-memory/dp/B01N1TYZR2
Mi Max 2 (Black, 64 GB)
http://www.amazon.in/Mi-Max-Black-64-GB/dp/B073VLGL5Y
Moto G5 Plus (32GB, Fine Gold)
http://www.amazon.in/Moto-Plus-32GB-Fine-Gold/dp/B071ZZ8N5Y
Apple iPhone SE (Space Grey, 32GB)
http://www.amazon.in/Apple-iPhone-SE-Space-Grey/dp/B071DF166C
Honor 8 Pro (Blue, 6GB RAM + 128GB Memory)
http://www.amazon.in/Honor-Pro-Blue-128GB-Memory/dp/B01N4FMUFH
Apple iPhone 7 (Black, 32GB)
http://www.amazon.in/Apple-iPhone-7-Black-32GB/dp/B01LZKSVRB
BlackBerry KEYone (LIMITED EDITION BLACK)
http://www.amazon.in/BlackBerry-KEYone-LIMITED-EDITION-BLACK/dp/B073ZLLVQ9
Apple iPhone SE (Gold, 32GB)
http://www.amazon.in/Apple-iPhone-SE-Gold-32GB/dp/B071RC52N6
Apple iPhone SE (Rose Gold, 32GB)
http://www.amazon.in/Apple-iPhone-SE-Rose-Gold/dp/B06ZXWWD6R
Apple iPhone 6s (Space Grey, 32GB)
http://www.amazon.in/Apple-iPhone-Space-Grey-32GB/dp/B01LX3A7CC
Samsung Galaxy J7 Max (Gold, 32GB)
http://www.amazon.in/Samsung-Galaxy-J7-Max-Gold/dp/B073PWKTRS
Honor 8 Pro (Black, 6GB RAM + 128GB Memory)
http://www.amazon.in/Honor-Pro-Black-128GB-Memory/dp/B01MQXNY1L
Samsung Galaxy J7 Max (Black, 32GB)
http://www.amazon.in/Samsung-Galaxy-J7-Max-Black/dp/B073PWDMHD
OnePlus 3T (Soft Gold, 6GB RAM + 64GB memory)
http://www.amazon.in/OnePlus-3T-Soft-Gold-memory/dp/B01FM7J3NA
Apple iPhone 6s (Gold, 32GB)
http://www.amazon.in/Apple-iPhone-6s-Gold-32GB/dp/B01M0CJNVL
Apple iPhone 6s (Rose Gold, 32GB)
http://www.amazon.in/Apple-iPhone-Rose-Gold-32GB/dp/B01LXF3SP9
Samsung Galaxy C7 Pro (Navy Blue, 64GB)
http://www.amazon.in/Samsung-Galaxy-Navy-Blue-64GB/dp/B01LXMHNMQ
Samsung J7 Prime 32GB ( Gold ) 4G VoLTE
http://www.amazon.in/Samsung-J7-Prime-32GB-VoLTE/dp/B06Y3HFZBQ
Vivo V5s (Matte Black) with Offers
http://www.amazon.in/Vivo-V5s-Matte-Black-Offers/dp/B071P2FNF2
Vivo V5s (Crown Gold) with Offers
http://www.amazon.in/Vivo-V5s-Crown-Gold-Offers/dp/B071VT6RG2

Here this crawler collects all the product headings and respective links of the products pages from a page of amazon.in . User just need to specify what kind of data or links to be crawled.Though the main use of web crawler is in search engines,this way it can also be used to collect some useful information.
Here all the HTML of the page is fetched using requests in plain text form.Then it is converted into a BeautifulSoup object.From that object all title and href with class s-access-detail-page is accessed.That's all how this basic web crawler works.

Posted on by:

Discussion

markdown guide
 

In addition to scraping, you do need a proxy in most of the cases. Smartproxy seems to have the best cost to quality ratio at the moment. Are you covering your proxies?

 
Sloan, the sloth mascot Comment marked as low quality/non-constructive by the community View code of conduct

Any of the residential services does work alright for scraping. Have tried Smartproxy and Luminati - both are quality. However, Smartproxy is a lot cheaper, Luminati has a higher IP pool.

 

Great, scraping is so great with python. Have you ever been wondering about using something like scrapy from here