DEV Community


Posted on

Build a simple python web crawler

What is a Web Crawler?

Web crawler is an internet bot that is used for web indexing in World Wide Web.All types of search engines use web crawler to provide efficient results.Actually it collects all or some specific hyperlinks and HTML content from other websites and preview them in a suitable manner.When there are huge number of links to crawl , even the largest crawler fails.For this reason search engines early 2000 were bad at providing relevant results,but now this process has improved much and proper results are given in an instant

Python Web Crawler

The web crawler here is created in python3.Python is a high level programming language including object-oriented, imperative, functional programming and a large standard library.
For the web crawler two standard library are used - requests and BeautfulSoup4. requests provides a easy way to connect to world wide web and BeautifulSoup4 is used for some particular string operations.

Example Code

import requests
from bs4 import BeautifulSoup
def web(page,WebUrl):
        url = WebUrl
        code = requests.get(url)
        plain = code.text
        s = BeautifulSoup(plain, "html.parser")
        for link in s.findAll('a', {'class':'s-access-detail-page'}):
            tet = link.get('title')
            tet_2 = link.get('href')
Enter fullscreen mode Exit fullscreen mode


C:\Python34\python.exe C:/Users/Babuya/PycharmProjects/Youtube/
Apple iPhone 6 (Gold, 32GB)
OnePlus 5 (Slate Gray 6GB RAM + 64GB memory)
OnePlus 5 (Midnight Black 8GB RAM + 128GB memory)
Apple iPhone 6 (Space Grey, 32GB)
OnePlus 5 (Soft Gold, 6GB RAM + 64GB memory)
Mi Max 2 (Black, 64 GB)
Moto G5 Plus (32GB, Fine Gold)
Apple iPhone SE (Space Grey, 32GB)
Honor 8 Pro (Blue, 6GB RAM + 128GB Memory)
Apple iPhone 7 (Black, 32GB)
Apple iPhone SE (Gold, 32GB)
Apple iPhone SE (Rose Gold, 32GB)
Apple iPhone 6s (Space Grey, 32GB)
Samsung Galaxy J7 Max (Gold, 32GB)
Honor 8 Pro (Black, 6GB RAM + 128GB Memory)
Samsung Galaxy J7 Max (Black, 32GB)
OnePlus 3T (Soft Gold, 6GB RAM + 64GB memory)
Apple iPhone 6s (Gold, 32GB)
Apple iPhone 6s (Rose Gold, 32GB)
Samsung Galaxy C7 Pro (Navy Blue, 64GB)
Samsung J7 Prime 32GB ( Gold ) 4G VoLTE
Vivo V5s (Matte Black) with Offers
Vivo V5s (Crown Gold) with Offers
Enter fullscreen mode Exit fullscreen mode

Here this crawler collects all the product headings and respective links of the products pages from a page of . User just need to specify what kind of data or links to be crawled.Though the main use of web crawler is in search engines,this way it can also be used to collect some useful information.
Here all the HTML of the page is fetched using requests in plain text form.Then it is converted into a BeautifulSoup object.From that object all title and href with class s-access-detail-page is accessed.That's all how this basic web crawler works.

Top comments (3)

adamant11 profile image

In addition to scraping, you do need a proxy in most of the cases. Smartproxy seems to have the best cost to quality ratio at the moment. Are you covering your proxies?

glutonion profile image
Comment marked as low quality/non-constructive by the community. View Code of Conduct

Any of the residential services does work alright for scraping. Have tried Smartproxy and Luminati - both are quality. However, Smartproxy is a lot cheaper, Luminati has a higher IP pool.

lukaszkuczynski profile image

Great, scraping is so great with python. Have you ever been wondering about using something like scrapy from here

Some comments may only be visible to logged-in visitors. Sign in to view all comments.