Beautiful Soup is a python module which can inspect HTML and XML content. It does not fetch any HTTP pages (use requests
or urllib3
for that), and you need to load the HTML into Beautiful Soup in order to process it.
If you ever used lxml or html5lib to parse HTML, BeautifulSoup uses these modules internally and this library makes it easier to parse HTML, similar to how requests is built on urllib3.
You import it with import bs4
, and therein is a class called BeautifulSoup
that represents an HTML structure. It takes the string representing an HTML file as an argument.
# html_doc is a string of HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
To print the HTML on the screen it would be beneficial to format it with whitespace using BeautifulSoup.prettify()
first.
>>> print(soup.prettify())
<html>
<head>
<title>
# ...
Check out the sample HTML file listed at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. It is simple enough to demonstrate extracting elements in an easy-to-understand way.
The BeautifulSoup object contains members for each HTML tag, such as <title>
, <p>
, <a>
and <b>
, and even <head>
and <body>
. In fact, if you use the find_all()
method and pass the tag name as a string without angle brackets, it will collect all tags across the HTML with that name and return them in a list. You can instruct find_all()
to search by CSS class by passing the class_
keyword argument. (The word "class" is a reserved keyword in Python so it had to be named class_
.)
The contents
member contains the element's children.
soup = BeautifulSoup(...)
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents # For an iterator, try .children
[<title>The Dormouse's story</title>]
You also get familiar parent
, next_sibling
and previous_sibling
elements to get the element's parent, next sibling node and previous sibling node respectively. Do note that instead of getting next/previous sibling elements repeatedly you can just get next_siblings
and previous_siblings
which return them all as an iterator.
And we're done
If you see anything incorrect in this post, let me know so I can fix them.
Top comments (0)