Web Scraper with Python (Beautiful Soup) & Deployment of it into Heroku [Part2]

Mirzokhid Mukhsidov — Mon, 10 Oct 2022 14:08:04 +0000

After writing the code portion of my project and testing it, I pushed it into the Heroku server. Since running the program regularly manually might get tedious over time I scheduled it (a.k.a cron job) so it runs automatically at a given time (every day in my case). Turns out Heroku does not allow unverified users (here is how to verify your account) to use add-ons so I scheduled it manually with the python schedule module. Later on, after I verified my account with a credit card I was able to use the Heroku Scheduler. In this post we will go through both of the ways. However, first we have to connect PostgreSQL to your database in Python.

Connect Python to Postgresql
Connecting in Python describes connecting to the database in the Heroku server with PostgreSQL. First you should install psycopg2 package
pip install psycopg2-binary
then connect to DATABASE_URL with this package

import os
import psycopg2

DATABASE_URL = os.environ['DATABASE_URL']

conn = psycopg2.connect(DATABASE_URL, sslmode='require')

Scheduling with Python Schedule
Python schedule module, as the name suggests, runs Python functions (or any other callable) periodically using a friendly syntax.

We install it with the command:
$ pip install schedule

Import schedule and time module:

import schedule
import time

Define a function:

def function_name():
    # ToDo

schedule.every(10).minutes.do(function_name)
schedule.every().hour.do(function_name)
schedule.every().day.at("10:30").do(function_name)
schedule.every().monday.do(function_name)
schedule.every().wednesday.at("13:15").do(function_name)
schedule.every().minute.at(":17").do(function_name)

while True:
    schedule.run_pending()
    time.sleep(1)

Source: https://schedule.readthedocs.io/en/stable/
https://www.youtube.com/watch?v=qquCAgwvL8Q

Pushing the code into the Heroku Server
Heroku is a quite popular cloud platform. On the Getting Started on Heroku with Python you will see in detail how to install Heroku CLI onto your machine and push your project into the server using Git.
Keep in mind that, unlike the tutorial above, Procfile, we must use worker process type!

Scheduling with Heroku Scheduler
For a free dyno Heroku gives you 550 hours per month (read more about dynos) plus 450 hours if you verify your account.
Pushing your code into Heroku with Python Schedule might use a lot of free dyno hours.

This is why we will take advantage of the Heroku Scheduler

Go to the "Recources" section of your app

Find Heroku Scheduler and add it

Click on Heroku Scheduler add-on

Create a job for a suitable time period and Save it

At the end you might check your work with
heroku logs --tail

Disclaimer!
Starting November 28th, 2022, free Heroku Dynos, free Heroku Postgres, and free Heroku Data for Redis will no longer be available.
More information
https://blog.heroku.com/next-chapter

Web Scraper with Python (Beautiful Soup) & Deployment of it into Heroku [Part1]

Mirzokhid Mukhsidov — Thu, 01 Sep 2022 18:38:17 +0000

A while ago I decided to create a web crawling project using Beautiful Soup (a Python library for pulling data out of HTML and XML files). Here is how I did it, hurdles I faced during development and how I overcame them.

We will use Virtual Environment throughout the development, here are the instructions on how to install it in Windows and here are the reasons.

To deactivate your virtual Environment simply type deactivate

Then we will create a requirements.txt file for listing all the dependencies for our Python project.
Your requirements might differ depending on your case.

requirements.txt is important! I was too lazy to do this step at the first attempt... However, sooner or later you have to do it at least in order to push it into Heroku.
pip install -r requirements.txt is the command to install the list of requirements.

Now let me show to you how to write code to actually scrap the given web site. In a nutshell, web scraping is extracting data from websites into the form of your choice (I wrote "https://www.scrapethissite.com/pages/" to csv file).

csv" width="800" height="286">

First we make web requests using python requests library.

As you can see below, we printed the content of received information and it is the same HTML page, which you can see with the keyboard combination of Ctrl+U in Chrome or by pressing the right click on your mouse, then View page source.

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)
print(request.content)

To withdraw actual data from HTML tags we are going to reach for the help of Beautiful Soup library.

Get .text from <title> tag:

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")
print(soup.title.text)

Withdraw hyperlinks with .a tag:

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")
print(soup.a)

.find_all():

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")

for i in soup.find_all('h3'):
    print(i.text)

You can even search with CSS class .find_all(class_="class_name")

import requests
from bs4 import BeautifulSoup

link = "https://www.scrapethissite.com/pages/"
request = requests.get(link)

soup = BeautifulSoup(request.content, "html5lib")

for i in soup.find_all(class_='class_name'):
    print(i.text)

The rule of thumb here is to find the piece of data from the source code of the web site (via Ctrl+F in Chrome) and extract the data using whatever tag it is in.

There are many tags on Beautiful Soup, and from my experience what I found out is, often tutorials or/and posts are not perfectly suitable for your case. Reading it on post sounds like I am shooting myself in the foot, doesn't it? 😅 Do not get me wrong posts/videos are by all means useful to get a general idea about the topic. Nonetheless, if you are working on a different situation it is better if you skim the docs, so you can tackle your problem with more adequate methods. Besides, by the time you are watching/reading the video tutorial/post things (versions) are very likely to be changed. So what I would suggest is going to what initially seems a hard way and read the documentary, rather than trying to cut the corners and ending up frustrated with wasted time.

Furthermore, if you need to insert scraped data into database in your local machine I would recommend you a real python article.
In the next part we will see how I pushed scrapper into Heroku Server and how to build a database there.

You can find the source code on my Git Hub page: https://github.com/Muxsidov/Scraper_Blog

DEV Community: Mirzokhid Mukhsidov

Web Scraper with Python (Beautiful Soup) & Deployment of it into Heroku [Part2]

Web Scraper with Python (Beautiful Soup) & Deployment of it into Heroku [Part1]