Using Beautiful Soup For Web Scraping

#webdev #python

Beautiful Soup is a python module which can inspect HTML and XML content. It does not fetch any HTTP pages (use requests or urllib3 for that), and you need to load the HTML into Beautiful Soup in order to process it.

If you ever used lxml or html5lib to parse HTML, BeautifulSoup uses these modules internally and this library makes it easier to parse HTML, similar to how requests is built on urllib3.

You import it with import bs4, and therein is a class called BeautifulSoup that represents an HTML structure. It takes the string representing an HTML file as an argument.

# html_doc is a string of HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

To print the HTML on the screen it would be beneficial to format it with whitespace using BeautifulSoup.prettify() first.

>>> print(soup.prettify())
<html>
 <head>
  <title>
# ...

Check out the sample HTML file listed at https://www.crummy.com/software/BeautifulSoup/bs4/doc/. It is simple enough to demonstrate extracting elements in an easy-to-understand way.

The BeautifulSoup object contains members for each HTML tag, such as <title>, <p>, <a> and <b>, and even <head> and <body>. In fact, if you use the find_all() method and pass the tag name as a string without angle brackets, it will collect all tags across the HTML with that name and return them in a list. You can instruct find_all() to search by CSS class by passing the class_ keyword argument. (The word "class" is a reserved keyword in Python so it had to be named class_.)

The contents member contains the element's children.

soup = BeautifulSoup(...)
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents  # For an iterator, try .children
[<title>The Dormouse's story</title>]

You also get familiar parent, next_sibling and previous_sibling elements to get the element's parent, next sibling node and previous sibling node respectively. Do note that instead of getting next/previous sibling elements repeatedly you can just get next_siblings and previous_siblings which return them all as an iterator.