DEV Community

Cover image for Web Scraper with Python (Beautiful Soup) & Deployment of it into Heroku [Part1]
Mirzokhid Mukhsidov
Mirzokhid Mukhsidov

Posted on

Web Scraper with Python (Beautiful Soup) & Deployment of it into Heroku [Part1]

A while ago I decided to create a web crawling project using Beautiful Soup (a Python library for pulling data out of HTML and XML files). Here is how I did it, hurdles I faced during development and how I overcame them.

Meme

We will use Virtual Environment throughout the development, here are the instructions on how to install it in Windows and here are the reasons.
Install virual environment

To deactivate your virtual Environment simply type deactivate
To deactivate your virtual Environment simply type deactivate

Then we will create a requirements.txt file for listing all the dependencies for our Python project.
requests.txtYour requirements might differ depending on your case.

requirements.txt is important! I was too lazy to do this step at the first attempt... However, sooner or later you have to do it at least in order to push it into Heroku.
pip install -r requirements.txt is the command to install the list of requirements.


Now let me show to you how to write code to actually scrap the given web site. In a nutshell, web scraping is extracting data from websites into the form of your choice (I wrote "https://www.scrapethissite.com/pages/" to csv file).

website => csv
web scrapping



First we make web requests using python requests library.

As you can see below, we printed the content of received information and it is the same HTML page, which you can see with the keyboard combination of Ctrl+U in Chrome or by pressing the right click on your mouse, then View page source.
Way to source code

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)
print(request.content)
Enter fullscreen mode Exit fullscreen mode

Source Code in Terminal
Source Code Chrome

To withdraw actual data from HTML tags we are going to reach for the help of Beautiful Soup library.

Get .text from <title> tag:

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")
print(soup.title.text)
Enter fullscreen mode Exit fullscreen mode

Out put of soup.title.text

Withdraw hyperlinks with .a tag:

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")
print(soup.a)
Enter fullscreen mode Exit fullscreen mode

.find_all():

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")

for i in soup.find_all('h3'):
    print(i.text)
Enter fullscreen mode Exit fullscreen mode

.find_all

You can even search with CSS class .find_all(class_="class_name")

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")

for i in soup.find_all(class_='class_name'):
    print(i.text)
Enter fullscreen mode Exit fullscreen mode

class_=''



The rule of thumb here is to find the piece of data from the source code of the web site (via Ctrl+F in Chrome) and extract the data using whatever tag it is in.
Ctrl + F


There are many tags on Beautiful Soup, and from my experience what I found out is, often tutorials or/and posts are not perfectly suitable for your case. Reading it on post sounds like I am shooting myself in the foot, doesn't it? 😅 Do not get me wrong posts/videos are by all means useful to get a general idea about the topic. Nonetheless, if you are working on a different situation it is better if you skim the docs, so you can tackle your problem with more adequate methods. Besides, by the time you are watching/reading the video tutorial/post things (versions) are very likely to be changed. So what I would suggest is going to what initially seems a hard way and read the documentary, rather than trying to cut the corners and ending up frustrated with wasted time.

Furthermore, if you need to insert scraped data into database in your local machine I would recommend you a real python article.
In the next part we will see how I pushed scrapper into Heroku Server and how to build a database there.

You can find the source code on my Git Hub page: https://github.com/Muxsidov/Scraper_Blog

Top comments (0)