Using Beautiful Soup For Web Scraping

#python #webdev

Beautiful Soup is a python module which can inspect HTML and XML content. It does not fetch any HTTP pages (use requests or urllib3 for that), and you need to load the HTML into Beautiful Soup in order to process it.

If you ever used lxml or html5lib to parse HTML, BeautifulSoup uses these modules internally and this library makes it easier to parse HTML, similar to how requests is built on urllib3.

You import it with import bs4, and therein is a class called BeautifulSoup that represents an HTML structure. It takes the string representing an HTML file as an argument.

# html_doc is a string of HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

To print the HTML on the screen it would be beneficial to format it with whitespace using BeautifulSoup.prettify() first.

>>> print(soup.prettify())
<html>
 <head>
  <title>
# ...

Check out the sample HTML file listed at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. It is simple enough to demonstrate extracting elements in an easy-to-understand way.

The BeautifulSoup object contains members for each HTML tag, such as <title>, <p>, <a> and <b>, and even <head> and <body>. In fact, if you use the find_all() method and pass the tag name as a string without angle brackets, it will collect all tags across the HTML with that name and return them in a list. You can instruct find_all() to search by CSS class by passing the class_ keyword argument. (The word "class" is a reserved keyword in Python so it had to be named class_.)

The contents member contains the element's children.

soup = BeautifulSoup(...)
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents  # For an iterator, try .children
[<title>The Dormouse's story</title>]

You also get familiar parent, next_sibling and previous_sibling elements to get the element's parent, next sibling node and previous sibling node respectively. Do note that instead of getting next/previous sibling elements repeatedly you can just get next_siblings and previous_siblings which return them all as an iterator.

And we're done

If you see anything incorrect in this post, let me know so I can fix them.

Image by Goumbik from Pixabay

Forem

Using Beautiful Soup For Web Scraping

And we're done

Top comments (0)

Read next

Day 40: Implementing Advanced Role-Based Access Control (RBAC) with OPA Gatekeeper

TypeScript CLI: Automate Build and Deploy Scripts

Deepseek R1 Locally | Top 5 Free Open-Source Tools | Ollama | Automation | RAG

Web Development Roadmap - Beginner to Intermediate