Basics of Scraping with Python 🐍

#python #webdev

Prologue

Hello, in this post I am gonna describe the process of writing a scrapper script in Python, with the help of the Beautiful Soup library.

Installing the dependencies

First of all, since Beautiful Soup is a 3rd-party community project, you have to install it via the PyPI registry.

pip install beautifulsoup4

Philosophy of Beautiful Soup

BS is a library that sits atop an HTML/XML parser (in our case it's the prior)

Basic Script

Now that we know how it works, let's write a tiny script:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests


WEBSITE = "https://google.com"


html = urlopen(WEBSITE)
bs = BeautifulSoup(html.read(), 'html.parser')

In this example, we also make use of the urllib requests library, this just downloads the HTML for us.
Then, we read it with the pre-declared html variable that contains the google.com document

Parsing data

Sometimes, we want to get specific parts of a document, such as a paragraph or an image.

You can search for a specific HTML tag in BeautifulSoup with the find() attribute.

Let's scrape the Google logo tag from their homepage!
Add the following lines of code to the already existing file:

google_logo = bs.find('img', { 'id': 'hplogo' })
print(google_logo)

This two lines of code will hopefully produce this output:

<img 
alt="Google" 
height="92" 
id="hplogo" 
src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
style="padding:28px 0 14px" 
width="272"/>

So, how does this work?
Well, we are using the find() method and passing to it some arguments.
To be exact, we are telling it that we are searching for an <img> tag with an id called 'hplogo'