DEV Community

Edward Pasenidis
Edward Pasenidis

Posted on

Basics of Scraping with Python ๐Ÿ

Snap

Prologue

Hello, in this post I am gonna describe the process of writing a scrapper script in Python, with the help of the Beautiful Soup library.

Installing the dependencies

First of all, since Beautiful Soup is a 3rd-party community project, you have to install it via the PyPI registry.

pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Philosophy of Beautiful Soup

BS is a library that sits atop an HTML/XML parser (in our case it's the prior)

Basic Script

Now that we know how it works, let's write a tiny script:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests


WEBSITE = "https://google.com"


html = urlopen(WEBSITE)
bs = BeautifulSoup(html.read(), 'html.parser')
Enter fullscreen mode Exit fullscreen mode

In this example, we also make use of the urllib requests library, this just downloads the HTML for us.
Then, we read it with the pre-declared html variable that contains the google.com document

Parsing data

Sometimes, we want to get specific parts of a document, such as a paragraph or an image.

You can search for a specific HTML tag in BeautifulSoup with the find() attribute.

Let's scrape the Google logo tag from their homepage!
Add the following lines of code to the already existing file:

google_logo = bs.find('img', { 'id': 'hplogo' })
print(google_logo)
Enter fullscreen mode Exit fullscreen mode

This two lines of code will hopefully produce this output:

<img 
alt="Google" 
height="92" 
id="hplogo" 
src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png"
style="padding:28px 0 14px" 
width="272"/>
Enter fullscreen mode Exit fullscreen mode

So, how does this work?
Well, we are using the find() method and passing to it some arguments.
To be exact, we are telling it that we are searching for an <img> tag with an id called 'hplogo'

Epilogue

That's all.
To learn more about Beautiful Soup, read the docs

Top comments (0)