Sachin

Posted on Apr 10, 2023 • Originally published at geekpython.in

Scraping Webpage Using BeautifulSoup In Python

The Internet is filled with lots of digital data that we might need for research or for personal interest. In order to get these data, we gonna need some web scraping skills.

Python has enough powerful tools to carry out web scraping tasks easily and effectively on large data.

In this tutorial, we are going to use requests and beautifulsoup libraries provided by Python.

What is web scraping?

Web scraping or web data extraction is the process of gathering information from the Internet. It can be a simple copy-paste of the data from specific websites or it can be an advanced data collection from websites that has real-time data.

Some websites don't mind extracting their data while some websites strictly prohibit data extraction on their websites.

If you are scraping websites for educational purposes then you're likely to not have any problem but if you are starting large-scale projects then be sure to check the website's Terms of Services.

Why do we need it?

Not all websites have APIs to fetch content, so to extract the content, we just left with only one option and that is to scrape the content.

Steps for web scraping

Inspecting the source of data
Getting the HTML content
Parsing the HTML with Beautifulsoup

Now let's move ahead and install the dependencies we'll need for this tutorial.

Installing the dependencies

We are going to install the requests library that helps us to get the HTML content of the website and beautifulsoup4 that parses the HTML.

pip install requests beautifulsoup4

Scraping the website

We are going to scrape the Wikipedia article on Python Programming Language. This webpage contains almost all types of HTML tags which will be good for us to test all aspects of BeautifulSoup.

1. Inspecting the source of data

Before writing any Python code, you must take a good look at the website you are going to perform web scraping.

You need to understand the structure of the website to extract the relevant information for the project.

Thoroughly, go through the website, perform basic actions, understand how the website works, and check the URLs, routes, query parameters, etc.

Inspecting the webpage using Developer Tools

Now, it's time to inspect the DOM (Document Object Model) of the website using Developer Tools.

Developer Tools help in understanding the structure of the website. It is capable of doing a range of things, from inspecting the loaded HTML, CSS, and JavaScript to showing the assets the page has requested and how long they took to load. All modern browsers come with Developer Tools installed.

To open dev tools simply right-click on the webpage and click on the Inspect option. This process is for the Chrome browser on Windows or simply apply the following keyboard shortcut -

Ctrl + Shift + I

For macOS, I think the command is -

⌘ + ⌥ + I

Now it's time to look at the DOM of our webpage that we are going to scrape.

We can see the HTML on the right that represents the structure of the page which we can see on the left side.

2. Get the HTML content

We need requests library to scrape the HTML content of the website which we already installed in our system.

Next, open up your favorite IDE or Code Editor and retrieve the site's HTML in just a few lines of Python code.

import requests

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

# Step 1: Get the HTML
r = requests.get(url)
htmlContent = r.content

# Getting the content as bytes
print(htmlContent)

# Getting the encoded content
print(r.text)

If we print the r.text we'll get the same output as the HTML we inspected earlier with the browser's developer tools. Now we have access to the site's HTML in our Python script.

Now let's parse the HTML using Beautiful Soup

3. Parse the HTML with Beautifulsoup

We have successfully scraped the HTML of the website but there is a problem if we look at it there are so many HTML elements lying here and there, and attributes and tags are scattered around. So we need to parse that lengthy response using Python code to make it more readable and accessible.

Beautiful Soup helps us to parse the structured data. It is a Python library for pulling out data from the HTML and XML files.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

# Step 1: Get the HTML
r = requests.get(url)
content = r.content

# Step 2: Parse the HTML
soup = BeautifulSoup(content, 'html.parser')
print(soup)

Here we added some lines to our previous code. We added an import statement for Beautiful Soup and then created a Beautiful Soup object that takes the content which holds the value of r.content.

The second argument we added in our Beautiful Soup object is html.parser. You must choose the right parser for the HTML content.

Find elements by ID

Elements in an HTML webpage can have an id attribute assigned to them. It makes an element in the page uniquely identifiable.

Beautiful Soup allows us to find the specific HTML element by its ID

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

id_content = soup.find(id="firstHeading")

We can use .prettify() to any beautiful soup object to prettify the HTML for easier viewing. Here we called .prettify() on id_content variable from above.

print(id_content.prettify())

Note: We cannot use .prettify() when using .find_all() method.

Find elements by Tag

In an HTML webpage, we encounter lots of HTML tags and we might want the data that resides in those tags. Like we want the hyperlinks that reside in the "a" (anchor) tag or want to scrape the description from "p" (paragraph) tag.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

# Getting the first <code> tag
find_tag = soup.find("code")
print(find_tag.prettify())

# Getting all the <pre> tag
all_pre_tag = soup.find_all("pre")

for pre_tag in all_pre_tag:
    print(pre_tag)

Find elements by HTML Class Name

We can see hundreds of elements like <div>,  or <a> with some classes in an HTML webpage, and through these classes, we can access the whole content present inside the specific element.

Beautiful Soup provides a class_ argument to find the content present inside an element with a specified class name.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

# Getting the "div" element with class name "mw-highlight"
class_elem = soup.find("div", class_="mw-highlight")
print(class_elem.prettify())

The first argument we provided inside the beautiful soup object is the element and the second argument we provided is the class name.

Find elements by Text Content and Class name

Beautiful Soup provides a string argument that allows us to search for a string instead of a tag. We can pass in a string, a regular expression, a list, a function, or the value True.

# Getting all the strings whose value is "Python"
find_str = soup.find_all(string="Python")
print(find_str)

.........
['Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python', 'Python']

We can also find the tags whose value matches the specified value for the string argument.

find_str_tag = soup.find_all("p", string="Python")

Here we are looking for the  tag in which the value "Python" must be there. But if we move ahead and try to print the result, then we'll get an empty result.

print(find_str_tag)
.........
[]

This is because when we use string= then our program looks exactly the same value as we provide. Any customization, whitespace, difference in spelling, or capitalization will prevent the element from matching.

If we provide the exact value then the program will run successfully.

find_str_tag = soup.find_all("span", string="Typing")
print(find_str_tag)

.........
[<span class="toctext">Typing</span>, <span class="mw-headline" id="Typing">Typing</span>]

Passing a Function

In the above section, when we try to find the  tag containing the string "Python" we got disappointment.

But Beautiful Soup allows us to pass a function as arguments. We can modify the above code to work perfectly fine after using the function.

# Creating a function
def has_python(text):
    return text in soup.find_all("p")

find_str_tag = soup.find_all("p", string=has_python("Python"))
print(len(find_str_tag))

Here we created a function called has_python which takes text as an argument and then it returns that text present in all the  tag.

Next, we passed that function to the string argument and pass the string "Python" to it. Then we printed the number of occurrences of the "Python" in all the  tags.

Extract Text from HTML elements

What if we do not want the content with the HTML tags attached to them. What if we want the clean and simple text data from the elements and tags.

We can use .text or .get_text() to return only the text content of the HTML elements that we pass in the Beautiful Soup object.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

table_elements = soup.find_all("table", class_="wikitable")

for table_data in table_elements:
    table_body = table_data.find("tbody")

    print(table_body.text) # or

    print(table_body.get_text())

We'll get the whole table as an output in text format. But there will be so many whitespaces between the text so we'll need to strip that data and remove the whitespaces by simply using .strip method.

print(table_body.text.strip())

There are other ways also to remove whitespaces. Check it out here.

Extract Attributes from HTML elements

An HTML page has numerous attributes like href, src, style, title, and more. Since an HTML webpage contains a large amount of <a> tags with href attributes so we are going to scrape all the href attributes present on our website.

We cannot scrape the attributes as we did in the above examples.

# Accessing href in the main content of the HTML page
anchor_in_body_content = soup.find(id="bodyContent")

# Finding all the anchor tags
anchors = anchor_in_body_content.find_all("a")

# Looping over all the anchor tags to get the href attribute
for link in anchors:
    links = link.get('href')
    print(links)

We simply looped over all the <a> tags in the main content of the HTML page and then used a .get('href') to get all the href attributes.

You can do the same for the src attributes also.

# Accessing src in body of the HTML page
img_in_body_content = soup.find(id="bodyContent")

# Finding all the img tags
media = img_in_body_content.find_all("img")

# Looping over all the img tags to get the src attribute
for img in media:
    images = img.get('src')
    print(images)

Access Parent and Sibling elements

Beautiful Soup allows us to access an element's parent by just using .parent attribute.

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

id_content = soup.find(id="cite_ref-123")

parent_elem = id_content.parent
print(parent_elem)

We can find grandparent or great-grandparent elements of an specific element passed in the beautiful soup object.

id_content = soup.find(id="cite_ref-123")

grandparent_elem = id_content.parent.parent
print(grandparent_elem)

There is another method that Beautiful Soup provides is .parents which helps us in iterating over all of an element's parents.

id_content = soup.find(id="cite_ref-123")

for elem in id_content.parents:
    print(elem) # to print the elements

    print(elem.name) # to print only the names of elements

Note: This program might take a little time to complete so wait until the program is finished.

Output for elem.name would be

p
div
div
div
div
body
html
[document]

Similarly we can access an element's next and previous siblings by using .next_sibling and .previous_sibling respectively.

id_content = soup.find(id="cite_ref-123")

# To print the next sibling of an element
next_sibling_elem = id_content.next_sibling

print(next_sibling_elem)

id_content = soup.find(id="cite_ref-123")

# To print the previous sibling of an element
previous_sibling_elem = id_content.previous_sibling

print(previous_sibling_elem)

Iterating over a tag's siblings using .next_siblings or .previous_siblings.

Iterating over all the next siblings

next_sibling_elem = id_content.next_sibling

for next_elem in id_content.next_siblings:
    print(next_elem)

Iterating over all the previous siblings

id_content = soup.find(id="cite_ref-123")

for previous_elem in id_content.previous_siblings:
    print(previous_elem)

Using Regular Expression

Last but not least, we can use regular expression to search for an element, tag, text, etc., in the HTML tree.

This code will find all the tags starting from p in the HTML element having id=bodyContent

import requests
from bs4 import BeautifulSoup
import re

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

r = requests.get(url)
content = r.content

soup = BeautifulSoup(content, 'html.parser')

id_content = soup.find(id="bodyContent")

for tag in id_content.find_all(re.compile("^p")):
    print(tag.name)

This code will match all the alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _. But we don't have elements starting from digits or underscore, so it will return all the tags and elements of an element passed in the Beautiful Soup object.

id_content = soup.find(id="bodyContent")

for tag in id_content.find_all(re.compile("\w")):
    print(tag.name)

Conclusion

Well, we learned how to scrape a static website though it can be different for dynamic websites which throw different data on different requests, or hidden websites that have authentication. There are more powerful scraping tools available for these types of websites like Selenium, Scrapy, etc.

requests library allows us to access the site's HTML which then can be helpful for us to pull out the data from HTML using Beautiful Soup.

There are many methods and functions still available that we haven't seen but we discussed some key functions and methods that are used most commonly.

🏆Other articles you might like if you liked this article

✅Build a custom deep learning model using the transfer learning technique.

✅Implement deep learning model into Flask app for image recognition.

✅Argmax function in TensorFlow and NumPy.

✅Build a Covid-19 EDA and Visualization app using streamlit in Python.

✅Deploy your streamlit app on Heroku servers in a few steps.

✅Powerful one-liners in Python to enhance code quality.

That's all for now

Keep Coding✌✌