DEV Community

Apify for Apify

Posted on • Originally published at blog.apify.com on

How to parse HTML in Python with PyQuery or Beautiful Soup

The internet is full of information, but most of its data is not structured for further processing. Web scraping is the best method of retrieving web data to put it into a structured format. However, web pages are written in HTML (Hypertext Markup Language). So to read the data from a web page, you first have to parse the HTML.

To "parse", I mean divide the HTML code into tags, attributes, etc., and organize these elements into a tree-like structure called a DOM. With the DOM, you can access and read the data you need and store it any way you want.

How to parse HTML in Python. Diagram of HTML file DOM structure.

The HTML file will converted into a tree-like structure (DOM)

Parsing HTML is a method you can use for lots of things, from web scraping to data mining. Python is one of the most popular languages for web scraping and working with data and Python libraries like Beautiful Soup and PyQuery are great for parsing HTML.

πŸ’‘Related: What are the best Python web scraping libraries?

In this blog post, we'll look at how to parse HTML using Python with PyQuery and Beautiful Soup. We'll also discuss which library is best and why. Finally, we'll take a look at advanced HTML parsing techniques and troubleshooting tips.

πŸ’‘Related: Learn how to parse JSON with Python

Overview of Beautiful Soup and PyQuery

PyQuery and Beautiful Soup offer powerful features for parsing HTML documents quickly and efficiently. These HTML parsing libraries are actively maintained, lightweight, high-performance, and have strong community support.

Beautiful Soup

Beautiful Soup is a popular library for parsing HTML files. It's widely used for web scraping and is easy to learn and apply. Beautiful Soup is a Python library, so you need to install it before using it.

Once youve installed Beautiful Soup, you can use its various functions to scrape data from HTML files. To scrape data from a web page, youll also need to use the requests module to perform HTTP requests and fetch the page's HTML.

Scraping data from HTML files with Beautiful Soup

Ways to extract data using Beautiful Soup

When you parse an HTML file using Beautiful Soup, a tree-like structure called soup is created. This tree retains the hierarchical structure of the HTML document, making it easy to navigate through the tree to extract data using methods such as find_all(), find(), and select().

Beautiful Soup has some common use cases, like web scraping (extracting stock prices, articles, and product info), data extraction (parsing an HTML document, extracting a specific piece of content, link, or title), data cleaning (removing unnecessary HTML tags and attributes), and more.

PyQuery

PyQuery is a powerful Python library that makes parsing HTML documents easy. It has a syntax and API similar to jQuery for parsing, manipulating, and querying HTML documents. It allows you to manipulate the document in various ways, such as adding elements, changing text or content, and modifying attributes.

With PyQuery, you can select elements from an HTML document with CSS selectors and XPath expressions, and it also offers some custom functions like filter(), eq(), and slice() to refine sections.

To work with PyQuery, you'll need to create a PyQuery object from an HTML file. Then you can use PyQuery to navigate the HTML structure, manipulate the text, and get data from it.

PyQuery is a popular choice for developers who are familiar with jQuery and prefer a similar syntax for HTML parsing in Python. Common use cases are web scraping, data extraction, and HTML manipulation.

Setup environment

Let's begin by installing Beautiful Soup and PyQuery on your device. To install Beautiful Soup on your system, make sure youve got Python 3 and the Requests module (for making HTTP requests to get data from the web page) installed on your system.

Install Beautiful Soup using the following command:

pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

After the installation is complete, check if Beautiful Soup has been successfully installed by using this:

pip show beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

To install PyQuery on your system, use the following command:

pip install pyquery
Enter fullscreen mode Exit fullscreen mode

After the installation is complete, check if PyQuery has been successfully installed by using this:

pip show pyquery
Enter fullscreen mode Exit fullscreen mode

If you encounter any problems while installing Beautiful Soup or PyQuery, here are a few things you can check:

  1. Make sure Python 3 is installed on your system because Beautiful Soup support for Python 2 has been discontinued.

  2. Make sure pip is installed. If it is installed, make sure to upgrade it (use the command python.exe -m pip install --upgrade pip).

  3. After pip is upgraded, try installing both libraries again. First, uninstall them (pip uninstall PyQuery or pip uninstall beautifulsoup4), and then reinstall them.

Now that you know what Beautiful Soup is and how to install it, let's get right into HTML parsing.

First, I'll show you how to perform basic HTML parsing from an HTML file. Then well move on to parsing actual web pages.

Basic HTML parsing with Beautiful Soup

Create a new HTML file in your code editor, and copy and paste the following HTML code. Well use this code to understand some fundamental parsing techniques.

test.html:

<html> <head> <title>Apify Blog</title> </head> <body> <header> <div class="nav"> <ul id="navlist" style="height:100px"> <li class="apify"><a href="https://apify.com">Apify.com</a></li> <li class="store"><a href="https://apify.com/store">Apify Store</a></li> <li class="documentation"><a href="https://apify.com/docs">Documentation</a></li> <li class="help"><a href="https://apify.com/help">Help</a></li> </ul> </div> </header> <div class="post"> <h2 class="title">First Blog</h2> <p class="content"> Python and APIs - Part 1: How to use Python to connect and interact with APIs <a href="https://blog.apify.com/python-and-apis/">Click here</a> In the first part of this Python and APIs series, you'll learn how to use Python to connect and interact with APIs. </p> </div> <div class="post"> <h2 class="title">Second Blog</h2> <p class="content"> 6 things you should know before buying or building a web scraper: <a href="https://blog.apify.com/6-things-to-know-about-web-scraping/">Click here</a> From understanding the complexity of target websites to legal considerations, discover what we wish we'd been telling our customers from day one. </p> </div> </body></html>
Enter fullscreen mode Exit fullscreen mode

Import the Beautiful Soup library, then open the HTML file using file handling techniques. Create a soup object by passing the HTML file to the BeautifulSoup constructor.

from bs4 import BeautifulSoupwith open("test.html") as file: soup = BeautifulSoup(file, "html.parser")
Enter fullscreen mode Exit fullscreen mode

Now that the soup object is created, let's retrieve the title of the "test.html" file. The title can be found at the location "head title".

print(soup.head.title) # Output: <title>Apify Blog</title>
Enter fullscreen mode Exit fullscreen mode

To get the text of the title, you can use the .text function.

print(soup.head.title.text) # Output: Apify Blog
Enter fullscreen mode Exit fullscreen mode

Suppose you want to retrieve the text of the first <li> tag, located at "body div ui".

print(soup.body.div.ul.li.text) # Output: Apify.com
Enter fullscreen mode Exit fullscreen mode

Have you noticed anything strange about the code snippet above?

No? Well, the strange thing is the multiple dots. Youre at the sixth level of the HTML tree, and navigating through all of those levels can be tedious.

Thankfully, Beautiful Soup provides two special methods called find() and find_all(), which you can use to retrieve HTML elements without having to navigate through the tree.

Let's start with the find_all() method. The find_all() method takes a string as an argument and returns a list of elements that match the provided string. For example, if you want all of the div elements with the class post, use the below code:

div_post = soup.find_all("div", class_="post")print(div_post)
Enter fullscreen mode Exit fullscreen mode

Heres the output:

[<div class="post"><h2 class="title">First Blog</h2><p class="content"> Python and APIs - Part 1: How to use Python to connect and interact with APIs <a href="https://blog.apify.com/python-and-apis/">Click here</a> In the first part of this Python and APIs series, you'll learn how to use Python to connect and interact with APIs. </p></div>, <div class="post"><h2 class="title">Second Blog</h2><p class="content"> 6 things you should know before buying or building a web scraper: <a href="https://blog.apify.com/6-things-to-know-about-web-scraping/">Click here</a> From understanding the complexity of target websites to legal considerations, discover what we wish we'd been telling our customers from day one. </p></div>]
Enter fullscreen mode Exit fullscreen mode

The find_all() method returns a list of all the div tags that have the class="post" attribute. Now let's extract the content and links inside each div tag. To do this, you can use the find() method. The find() method works the same as find_all(), but it returns the first matching element instead of a list.

for post in div_post: # Find all the <h2> tags inside the <div> with the class post that also has the class title, and then print the text using the (.text) method. title = post.find("h2", class_="title").text # Find all the <p> tags inside the <div> with the class post that also has a content class, and then print the text using the (.text) method. content = post.find("p", class_="content").text # Find all the href attributes of <a> tags inside the <div> with the class post. link = post.find("a")["href"] print("Title: ", title) print("Content: ", content) print("Link: ", link)
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

Title: First BlogContent: Python and APIs - Part 1: How to use Python to connect and interact with APIs Click here In the first part of this Python and APIs series, you'll learn how to use Python to connect and interact with APIs.Link: https://blog.apify.com/python-and-apis/Title: Second BlogContent: 6 things you should know before buying or building a web scraper: Click here From understanding the complexity of target websites to legal considerations, discover what we wish we'd been telling our customers from day one.Link: https://blog.apify.com/6-things-to-know-about-web-scraping/
Enter fullscreen mode Exit fullscreen mode

Now, what if you want the whole text of the HTML file? You can use the get_text() method.

print(soup.get_text())
Enter fullscreen mode Exit fullscreen mode

This produces the following output:

Apify Blog Apify.com Apify Store DocumentationHelpFirst Blog Python and APIs - Part 1: How to use Python to connect and interact with APIs Click here In the first part of this Python and APIs series, you'll learn how to use Python to connect and interact with APIs.Second Blog 6 things you should know before buying or building a web scraper: Click here From understanding the complexity of target websites to legal considerations, discover what we wish we'd been telling our customers from day one.
Enter fullscreen mode Exit fullscreen mode

Scraping a website Beautiful Soup in action

Now that youve got a basic understanding of Beautiful Soup, let's scrape a website to better understand HTML parsing.

For this purpose, well use a website thats specifically designed for scraping: https://crawler-test.com/

Making an HTTP request

First, you need to make an HTTP request to the above URL using the Python Requests library to extract the content from a web page. This is why we included installing the Requests library as a prerequisite above.

import requests# Make an HTTP Requesttarget_url = '<https://crawler-test.com/>'response_data = requests.get(target_url)print('Response status: ', response_data.status_code)print('Response content: ', response_data.text[:15])
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

Response status: 200Response content: <!DOCTYPE html>
Enter fullscreen mode Exit fullscreen mode

We made an HTTP request to the URL, and it returned a status code of 200, which means that the request was successful. Then we printed the text up to 15 characters because otherwise, the returned text would have been very large.

Parsing the response

When you make an HTTP request, you receive some data as a response. This data is the HTML content of the web page. The HTML content is hidden inside the response_data.text, so we simply pass it to the BeautifulSoup constructor for parsing.

We parse the HTML content because after parsing, it becomes a tree-like structure. This makes it easy for us to navigate through the tree using built-in methods.

from bs4 import BeautifulSoupimport requests# Make an HTTP Requesttarget_url = 'https://crawler-test.com/'response_data = requests.get(target_url)# Parse the HTML contentsoup = BeautifulSoup(response_data.text, 'html.parser')
Enter fullscreen mode Exit fullscreen mode

Getting started with scraping HTML

Use the find_all() and find() methods to extract HTML tags from the parsed HTML. For example, if you want to extract the title, headings, and links, you can use the following code:

title = soup.find('title')link = soup.find_all('a', href=True) # List of 'a' tagsheading = soup.find('h1')print(title)print(link)print(heading)
Enter fullscreen mode Exit fullscreen mode

Youll get the title, heading1, and a list of <a> tags as the output.

To find an HTML element by class name , pass the class name as a dictionary argument to the find_all() method. For example, if you want to extract all the h3 headings of the class panel-header, you can use the following code:

container = soup.find_all('div', {'class': 'panel'})heading_text = []for box in container: title = box.find('h3') heading_text.append(title.text)print(heading_text)
Enter fullscreen mode Exit fullscreen mode

Now let's see what this code does. Every panel class has a panel-header class, and every panel-header class has an h3heading, as you can see in the visual below.

How to parse HTML in Python. Expanding the panel class and scraping the headings

Expand the panel class and then scrape the headings

Well first create a container by using the find_all() method. Then well loop through each of the boxes present inside the container.

How to parse HTML in Python. Container.

Many boxes are present inside a single container

Heres the output:

['Mobile', 'Description Tags', 'Encoding', 'Titles', 'Robots Protocol', 'Redirects', 'Links', 'URLs', 'Canonical Tags', 'Status Codes', 'Social Tags', 'Content', 'Other']
Enter fullscreen mode Exit fullscreen mode

To find an HTML element by its ID name , pass the ID name to the id parameter of the find() method. In the image below, we want to extract the div with the ID header. This HTML div part focuses on the header of this website.

How to parse HTML in Python. Showing the HTML text of the header

Showing the HTML text of the header

print(soup.find(id='header'))
Enter fullscreen mode Exit fullscreen mode

This produces the following output:

<div id="header"><a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a><div style="position:absolute;right:520px;top:-4px;"></div></div>
Enter fullscreen mode Exit fullscreen mode

Another approach to selecting elements is to use the select() method and pass the CSS selector as a string to the method.

# Display all <div> tags that have the panel-header class.print(soup.select('div .panel-header'))# Display all the <h3> headings that are inside the panel-header class.print(soup.select('div .panel-header > h3')) # Display all the <a> tags that are inside the panel class.print(soup.select('div .panel > a'))
Enter fullscreen mode Exit fullscreen mode

Great! Weve explored some useful methods of Beautiful Soup to parse HTML in Python.

There are hundreds of other Beautiful Soup methods (some of which are quite advanced) that can be used for various purposes. Well discuss some of those advanced HTML parsing techniques later.

Basic HTML parsing with PyQuery

Let's dive into HTML parsing with PyQuery. If youre already familiar with jQuery (a JavaScript library), then the syntax will be easy for you to understand. If youre not experienced with jQuery, don't worry, youll still be able to understand PyQuery easily.

As we did with Beautiful Soup, well first look at HTML parsing with PyQuery using an HTML file. Then well move on to parsing real-world web pages. I hope youve got the HTML file (test.html) that we created in the previous section ready!

Import the PyQuery library. Then, open the HTML file using file handling techniques. Create a PyQuery object by passing the HTML data to the pq constructor.

from pyquery import PyQuery as pqwith open("test.html", "r") as file: html_content = file.read() pq_object = pq(html_content)
Enter fullscreen mode Exit fullscreen mode

Cool! The PyQuery object has been created. Now, suppose you want the text of all the <h2> headings. To do this, you can pass the string "h2" to the PyQuery object and use the .text() method to extract the text.

    text = pq_object("h2").text() print(text) # Output: First Blog Second Blog
Enter fullscreen mode Exit fullscreen mode

What if you want to extract the HTML of the particular tag instead of text? To do this, you can use the .html() method.

text = pq_object("ul").html()print(text)
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

<li class="apify"><a href="https://apify.com">Apify.com</a></li><li class="store"><a href="https://apify.com/store">Apify Store</a></li><li class="documentation"><a href="https://apify.com/docs">Documentation</a></li><li class="help"><a href="https://apify.com/help">Help</a></li>
Enter fullscreen mode Exit fullscreen mode

You can retrieve the attributes of a particular tag using the .attr() method. For example, in the code below, we retrieve the id and style attributes of the ul tag.

tag = pq_object("ul")print(tag.attr("id"))print(tag.attr("style"))
Enter fullscreen mode Exit fullscreen mode

Now let's play with <a> tags. You can print all <a> tags with the proper HTML, print only the text of the <a> tags, and print specific <a> tags.

To retrieve all <a> tags with their HTML, simply pass the <a> tag as a string to the PyQuery object. This will create a list of all the <a> tags, and then you can loop through the tags and print them one by one.

Additionally, were passing the loop variable to the PyQuery object. This is because if we print the variable directly (print(i)), itll print the location at which the tag is present, such as <Element a at 0x230d97b3500>.

for i in pq_object("a"): print(pq_object(i))
Enter fullscreen mode Exit fullscreen mode

Heres the output:

<a href="https://apify.com">Apify.com</a><a href="https://apify.com/store">Apify Store</a><a href="https://apify.com/docs">Documentation</a><a href="https://apify.com/help">Help</a><a href="https://blog.apify.com/python-and-apis/">Click here</a> In the first part of this Python and APIs series, you'll learn how to use Python to connect and interact with APIs.<a href="https://blog.apify.com/6-things-to-know-about-web-scraping/">Click here</a> From understanding the complexity of target websites to legal considerations, discover what we wish we'd been telling our customers from day one.
Enter fullscreen mode Exit fullscreen mode

To retrieve the text content of only the <a> tag, use the .text() method.

for i in pq_object("a"): print(pq_object(i).text())
Enter fullscreen mode Exit fullscreen mode

This produces the following output:

Apify.comApify Store DocumentationHelpClick here Click here
Enter fullscreen mode Exit fullscreen mode

To retrieve a specific tag, you can use the eq() method. Simply pass the index of the tag as an argument. For example, to retrieve the first <a> tag, pass 0 as an argument to the eq() method. To retrieve the second <a> tag, pass 1, and so on.

print(pq_object("a").eq(0)) # Output: <a href="https://apify.com">Apify.com</a>print(pq_object("a").eq(0).text()) # Output: Apify.com
Enter fullscreen mode Exit fullscreen mode

Scraping a website - PyQuery in action

Now that youve got a basic understanding of PyQuery, let's scrape a website to better understand HTML parsing with PyQuery. For this purpose, well use the same website as before: https://crawler-test.com/

Making an HTTP request

First, you need to make an HTTP request to the above URL using the Python Requests library to extract the content from a web page.

import requests# Make an HTTP Requesturl = 'https://crawler-test.com/'response = requests.get(target_url)print('Response status: ', response.status_code)print('Response content: ', response.text[:15])
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

Response status: 200Response content: <!DOCTYPE html>
Enter fullscreen mode Exit fullscreen mode

Again, we made an HTTP request to the URL, and it returned a status code of 200, which means that our request was successful.

Parsing the response

When you make an HTTP request, you receive some data as a response. This data is the HTML content of the web page. The HTML content is hidden inside the response.text, so we simply pass it to the PyQuery (pq) constructor for parsing.

from pyquery import PyQuery as pqimport requestsurl = "https://crawler-test.com"# Send a GET request to the URL and store the responseresponse = requests.get(url)# Create a PyQuery document from the response textdocument = pq(response.text)
Enter fullscreen mode Exit fullscreen mode

A parsed PyQuery object has been created and named document. You can now use the built-in methods to scrape the information from the HTML of the web page.

Getting started with scraping HTML

First, let's retrieve the title of the URL using the .text() method.

print(document('title').text()) # Output: Crawler Test Site
Enter fullscreen mode Exit fullscreen mode

To retrieve the text of all heading 3 elements within the panel-header class of a div element, you can pass the class name as a string to the constructor and then use the find() method to extract the text of all the heading 3 elements.

target_div = document('.panel-header') headings = target_div.find('h3')for heading in headings: print(heading.text)
Enter fullscreen mode Exit fullscreen mode

Heres the output:

MobileDescription TagsEncoding TitlesRobots Protocol Redirects LinksURLsCanonical Tags Status Codes Social Tags ContentOther
Enter fullscreen mode Exit fullscreen mode

Let's retrieve all the div elements that have id = header. To do this, you can pass the selector div#header to the document constructor. The #header selector indicates that this is an ID selector.

target_div = document('div#header')print(target_div)
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

<div id="header"> <a href="/" id="logo">Crawler Test <span class="neon-effect">two point oh!</span></a> <div style="position:absolute;right:520px;top:-4px;"/> </div>
Enter fullscreen mode Exit fullscreen mode

To retrieve the text of all <a> elements within all elements that have id=header, you can first find all elements that have the ID of header and then pass the <a> element.

header = document("#header")a_tag = header("a")print(a_tag.text())
Enter fullscreen mode Exit fullscreen mode

This creates the following output:

Crawler Test two point oh!
Enter fullscreen mode Exit fullscreen mode

Let's retrieve the link inside the specified tag. First, well extract all the <a> tags with the ID of logo and then use the .attr() method to retrieve the link by passing the href attribute as a string. Unfortunately, this <a> tag only has / in the href attribute.

logo_links = document('a#logo')print(logo_links.attr['href']) # Output: /
Enter fullscreen mode Exit fullscreen mode

You can retrieve all the links on a web page, but it would be better if you could also add some filters. For example, you might want to retrieve all the links that have a particular keyword in the link or in the text of the <a> tag.

In the following code, were trying to extract all the links that have HTTP keywords in their text.

all_links = document('a')# Use the filter() function to select only links with certain criteriafiltered_links = all_links.filter(lambda i, elem: 'Status' in pq(elem).text())print(filtered_links)
Enter fullscreen mode Exit fullscreen mode

Heres the output:

<a href="/description_tags/description_http_equiv">HTTP-equiv description</a><a href="/status_codes/status_100">100 HTTP Status</a><a href="/status_codes/status_101">101 HTTP Status</a><a href="/status_codes/status_102">102 HTTP Status</a>......
Enter fullscreen mode Exit fullscreen mode

Now, what if you want to filter all the tags that have a particular keyword in their links? To do this, you need to use the .attr('href') function so that you can directly search for the keyword in the link.

filtered_links = all_links.filter(lambda i, elem: 'status_codes' in pq(elem).attr['href'])print(filtered_links)
Enter fullscreen mode Exit fullscreen mode

This produces the following output:

<a href="/status_codes/status_100">100 HTTP Status</a><a href="/status_codes/status_101">101 HTTP Status</a><a href="/status_codes/status_102">102 HTTP Status</a><a href="/status_codes/status_200">200 HTTP Status</a>......
Enter fullscreen mode Exit fullscreen mode

Great! Now weve also explored some useful PyQuery methods to parse HTML in Python. There are also some advanced PyQuery methods that can be used for various purposes. Well look at these below.

Advanced HTML parsing techniques

We've covered the basics of HTML parsing with PyQuery and Beautiful Soup. Now let's look at some advanced HTML parsing techniques. You're gonna learn how to parse HTML using regular expressions, how to fix broken HTML, how to remove unnecessary tags, and the concept of parents, children, and siblings.

Parsing HTML with RegEx

One method of parsing HTML is through the use of regular expressions (RegEx). Let's say you want to scrape all the p tags from the HTML content. For this, use the following regular expression pattern:

pattern = r'<p.*?>(.*?)</p>'
Enter fullscreen mode Exit fullscreen mode

Let's break down that pattern:

  • .*? matches any character zero or more times until it reaches the first > character.

  • (.*?) captures the text between the <p> and </p> tags.

Beautiful Soup:

import refrom bs4 import BeautifulSoupwith open('test.html', 'r') as file: html_content = file.read()soup = BeautifulSoup(html_content, 'html.parser')p_tags = re.findall(r'<p .*?>(.*?)</p>', str(soup), re.DOTALL)for p in p_tags: print(f"Found Paragraph: {p}")
Enter fullscreen mode Exit fullscreen mode

PyQuery:

import refrom pyquery import PyQuery as pqwith open('test.html', 'r') as file: html_content = file.read()doc = pq(html_content)p_tags = re.findall(r'<p .*?>(.*?)</p>', str(doc), re.DOTALL)for p in p_tags: print(f"Found Paragraph: {p}")
Enter fullscreen mode Exit fullscreen mode

There are various methods in regular expressions (RegEx) for forming different patterns according to your needs. But well not delve any deeper into RegEx here. For more information, you can refer to the official documentation.

Find parent, child(ren), and sibling(s)

To find the parent, use the find_parent() method. You can print the name of the parent using the .name attribute.

from bs4 import BeautifulSoupwith open('test.html', 'r') as file: html_content = file.read()soup = BeautifulSoup(html_content, 'html.parser')target_element = soup.find('li')parent = target_element.find_parent()print("Parent:", parent.name)
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

Parent: ul
Enter fullscreen mode Exit fullscreen mode

To find the first child of an HTML element, use the find_child() method. To find all the children of an HTML element, use the find_children() method.

target_element = soup.find('ul')print("Single Child:")print(target_element.findChild())print("Children:")print(target_element.findChild())
Enter fullscreen mode Exit fullscreen mode

This produces the following output:

Single Child: <li class="apify"><a href="<https://apify.com>">Apify.com</a></li>Children:[<li class="apify"><a href="https://apify.com">Apify.com</a></li>, <a href="<https://apify.com>">Apify.com</a>, <li class="store"><a href="https://apify.com/store">Apify Store</a></li>, <a href="https://apify.com/store">Apify Store</a>, <li class="documentation"><a href="https://apify.com/docs">Documentation</a></li>, <a href="https://apify.com/docs">Documentation</a>, <li class="help"><a href="https://apify.com/help">Help</a></li>, <a href="https://apify.com/help">Help</a>]
Enter fullscreen mode Exit fullscreen mode

To find the next sibling after an HTML element, use the find_next_sibling() method.

target_element = soup.find('h2')sibling = target_element.find_next_sibling()print("Sibling", sibling)print("Sibling name", sibling.name)
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

Sibling:<p class="content"> Python and APIs - Part 1: How to use Python to connect and interact with APIs <a href="https://blog.apify.com/python-and-apis/">Click here</a> In the first part of this Python and APIs series, you'll learn how to use Python to connect and interact with APIs.</p>Sibling name: p
Enter fullscreen mode Exit fullscreen mode

To find all the next siblings after an HTML element, use the find_next_siblings() method.

target_element = soup.find('li')sibling = target_element.find_next_siblings()print(sibling)
Enter fullscreen mode Exit fullscreen mode

This produces the following output:

[<li class="store"><a href="https://apify.com/store">Apify Store</a></li>, <li class="documentation"><a href="https://apify.com/docs">Documentation</a></li>, <li class="help"><a href="https://apify.com/help">Help</a></li>]
Enter fullscreen mode Exit fullscreen mode

You can also find the previous sibling before an HTML element using the find_previous_sibling() method. Also, to find all the previous siblings, use the find_previous_siblings() method.

target_element = soup.find('p')sibling = target_element.find_previous_sibling()print(sibling)
Enter fullscreen mode Exit fullscreen mode

Heres the output:

<h2 class="title">First Blog</h2>
Enter fullscreen mode Exit fullscreen mode

Unknown tags

You may have heard of the <description> tag and <canonical> tag. These tags are found in the head of every HTML file, but not many people talk about them. So, let's retrieve the <description> tag and <canonical> tag using Beautiful Soup and PyQuery.

Below is an image of the HTML from the URL https://blog.apify.com/. Well try to retrieve these tags from this URL. You can see the <description> and <canonical> tags in the image.

How to parse HTML in Python. Description and canonical tags

See the description and canonical tags in the above HTML code

Beautiful Soup:

import requestsfrom bs4 import BeautifulSoupurl = "https://blog.apify.com/"response = requests.get(url)html_text = response.textsoup = BeautifulSoup(html_text, 'html.parser')# Extract description tagdescription_tag = soup.find('meta', attrs={'name': 'description'})description = description_tag['content'] if description_tag else ''# Extract canonical tagcanonical_tag = soup.find('link', attrs={'rel': 'canonical'})canonical = canonical_tag['href'] if canonical_tag else ''print("Description:", description)print("Canonical:", canonical)
Enter fullscreen mode Exit fullscreen mode

PyQuery:

from pyquery import PyQuery as pqimport requestsurl = "https://blog.apify.com/"response = requests.get(url)html_text = response.textdoc = pq(html_text)# Extract description tagdescription = doc('meta[name="description"]').attr('content') or ''# Extract canonical tagcanonical = doc('link[rel="canonical"]').attr('href') or ''print("Description:", description)print("Canonical:", canonical)
Enter fullscreen mode Exit fullscreen mode

Heres the code output:

Description: Product updates, tips, and stories from the world of web scraping, data extraction, and web automation. Apify lets you turn any website into an API.Canonical: https://blog.apify.com/
Enter fullscreen mode Exit fullscreen mode

Removing specific HTML Tags

To remove specific HTML tags, you can use the decompose() method. In the code below, were trying to remove the litag that has a help class.

Beautiful Soup:

from bs4 import BeautifulSoupwith open("test.html") as file: soup = BeautifulSoup(file, "html.parser") for li_tag in soup.find_all('li', class_='help'): li_tag.decompose()# Get the modified HTMLmodified_html = str(soup)print(modified_html)
Enter fullscreen mode Exit fullscreen mode

PyQuery:

from pyquery import PyQuery as pqwith open('test.html', 'r') as file: html_content = file.read()# Create a PyQuery objectdoc = pq(html_content)# Remove all <li> tags with class="help"doc('li.help').remove()# Get the modified HTMLmodified_html = doc.html()print(modified_html)
Enter fullscreen mode Exit fullscreen mode

Sometimes youll get HTML content that is very messy with empty tags. To remove these, you can use the .extract() method.

from bs4 import BeautifulSoup# Read the HTML filewith open("test.html") as file: soup = BeautifulSoup(file, "html.parser")# Find and remove empty tagsfor tag in soup.find_all(): if len(tag.get_text(strip=True)) == 0: tag.extract()# Print the cleaned HTMLprint(soup)
Enter fullscreen mode Exit fullscreen mode

Fix broken HTML

Sometimes we might forget to close tags. Other times, open tags are deleted by mistake. In that case, you can fill in the missing HTML elements using the prettify() method.

What exactly does this method do? Suppose youve got the following HTML file that does not have the closing tags </ul> and </div>. The prettify() method will add these closing tags.

broken1.html:

<div class="nav"> <ul id="navlist" style="height: 100px"> <li class="apify"><a href="https://apify.com">Apify.com</a></li> <li class="help"><a href="https://apify.com/help">Help</a></li>
Enter fullscreen mode Exit fullscreen mode

If you have an HTML file in which the opening tags are missing, see the following file. The prettify() method will remove the closing tags in this case as well.

broken2.html:

<li class="apify"><a href="https://apify.com">Apify.com</a></li> <li class="help"><a href="https://apify.com/help">Help</a></li> </ul></div>
Enter fullscreen mode Exit fullscreen mode

Here's the code:

from bs4 import BeautifulSoupwith open('file_name.html', 'r') as file: html_content = file.read()soup = BeautifulSoup(html_content, 'html.parser')print(soup.prettify())
Enter fullscreen mode Exit fullscreen mode

This is the output for broken1.html:

<div class="nav"> <ul id="navlist" style="height: 100px"> <li class="apify"> <a href="https://apify.com"> Apify.com </a> </li> <li class="help"> <a href="https://apify.com/help"> Help </a> </li> </ul></div>
Enter fullscreen mode Exit fullscreen mode

And heres the output for broken2.html:

<li class="apify"> <a href="https://apify.com"> Apify.com </a></li><li class="help"> <a href="https://apify.com/help"> Help </a></li>
Enter fullscreen mode Exit fullscreen mode

Other Methods - first/last and odd/even

Look at both boxes in the image below. The first <a> tag occurs in the left box, and the last <a> tag occurs in the right box. We can retrieve the first and last elements using the first() and last() methods of pyquery. To do this, we simply pass the tag for which we want to find the first and last occurrence, followed by a colon, and then the first() or last()function.

You can also retrieve the particular tag in the odd or even occurrence. For example, if you want to retrieve all the text of the <a> tag in an odd or even manner, you would use the odd() or even() function, similar to the first() or last()functions.

How to parse HTML in Python. First and last functions

From the first text to the last text of the web page, you can print using the first() and last() functions

The code below should help give you a better understanding. Were only retrieving the first and last <a> tags.

first_li = document("a:first")print(first_li.text()) # Output: Crawler Test two point oh!last_li = document("a:last")print(last_li.text()) # Output: Basic Auth
Enter fullscreen mode Exit fullscreen mode

Now let's retrieve a tag in odd and even forms. Youll get a list of elements, so to print line by line, use the each() function as shown below:

from pyquery import PyQuery as pqimport requestsdef print_item(self, item): print(item.text)url = "https://crawler-test.com//"response = requests.get(url)document = pq(response.text)odd_lis = document("a:odd") odd_lis.each(print_item)
Enter fullscreen mode Exit fullscreen mode

Heres the output:

Separate Desktop page with separate mobile and/or AMPSeparate Desktop with different H1Separate Desktop with different wordcountSeparate Desktop with different links outSeparate Desktop with mobile and AMP which self canonicalisesA dynamically served pageNo mobile configurationAn AMP page which also has a dedicated mobile page............
Enter fullscreen mode Exit fullscreen mode

Troubleshooting HTML parsing issues

When working with HTML parsers, you may need to troubleshoot some issues. Here are some suggestions for fixing an HTML parser in Python:

  • Make sure there are no syntax errors in your code.

  • Be sure to import the parser correctly or try a different parser, such as lxml or html5lib.

  • Make sure youre using the latest version of Python.

  • Check the HTML source code for errors.

If you don't understand what Beautiful Soup is doing to your HTML, pass it to the diagnose() function (new in Beautiful Soup 4.2.0). You'll get a report showing how the different parsers handle your document. See the code below.

from bs4.diagnose import diagnosewith open("test.html") as fp: data = fp.read()print(diagnose(data))
Enter fullscreen mode Exit fullscreen mode

Suppose this is our html file - closing tags are missing.

<div class="nav"> <ul id="navlist" style="height: 100px"> <li class="apify"><a href="https://apify.com">Apify.com</a></li> <li class="help"><a href="https://apify.com/help">Help</a></li>
Enter fullscreen mode Exit fullscreen mode

When we pass the file to the diagnose function, it will automatically add the closing tags. This is one use case. If you look at the output of diagnose(), you may be able to figure out the problem. If not, you can paste the output of diagnose()when you ask for help in any community platform like GitHub or Stack Overflow.

How to parse HTML in Python. Diagnose function

Since HTML tags and attributes are not case-sensitive, all three HTML parsers convert them to lowercase. That is, the markup is converted to . To preserve mixed-case or uppercase tags and attributes, you'll need to parse it as XML.

soup = BeautifulSoup(html_content, "xml")
Enter fullscreen mode Exit fullscreen mode

A common error that can occur is SyntaxError: Invalid syntax. It's caused by running the Python 2 version of Beautiful Soup under Python 3 without converting the code.

You can also get ImportError: No module named html.parser. This error is caused by running the Python 3 version of Beautiful Soup under Python 2.

You may also encounter this error: ImportError: No module named BeautifulSoup. This can arise by writing Beautiful Soup 4 code without knowing that the package name has changed to bs4.

PyQuery vs. Beautiful Soup

PyQuery and Beautiful Soup are both great Python libraries for working with HTML and XML documents. Weve looked at these libraries in depth, so let's compare them.

PyQuery uses LXML internally for fast XML and HTML manipulation. LXML is written in C, which makes it fast and lightweight. However, the speed difference is negligible unless you are working with very large documents.

This GitHub gist code will help you test the response times of Beautiful Soup and PyQuery, as well as other similar libraries. I ran this code, and the output was shocking:

==== Total trials: 100000 =====bs4 total time: 18.1pq total time: 2.2lxml (cssselect) total time: 2.2lxml (xpath) total time: 1.7
Enter fullscreen mode Exit fullscreen mode

As you can see from the above result, there's a huge difference in speed between Beautiful Soup and PyQuery.

Beautiful Soup is great for parsing malformed HTML. It can handle common mistakes and inconsistencies in HTML markup. Beautiful Soup has more built-in functions than PyQuery. Thats why web scraping developers love it. PyQuery is fast and efficient if you have perfectly formatted HTML documents, but it doesn't work well with poorly formatted documents.

Both PyQuery and Beautiful Soup have active community support. However, Beautiful Soup has a more active community and more learners. There are around 32,276 questions tagged with Beautiful Soup on Stack Overflow, while there are very few questions tagged with PyQuery.

Have a look at the interest over time of Beautiful Soup and PyQuery:

Beautiful Soup and PyQuery trends

Beautiful Soup and PyQuery trends

Here's a summary of the key differences between the PyQuery and Beautiful Soup libraries:

Feature Beautiful Soup PyQuery
Syntax Similar to ElementTree Similar to jQuery
Performance Slower Faster
Featueres More Less
Multiple Parser Support Yes (lxml, html5lib, html.parser) No (only lxml by default)
Community Support More Active Less Active
When to choose When you need a library with a wide range of features When you need a fast library for parsing HTML and XML

Finishing up

You've gained some valuable insights into how to parse HTML in Python. Youve also received some troubleshooting advice and a helpful comparison between both libraries. Either can be a good choice for working with HTML and XML files in Python.

Now that youve read this blog post, here are a few things you might like to do next:

  • Learn more about PyQuery and Beautiful Soup from their official documentation.

  • Use both libraries to build real-world applications, such as a web scraper that can analyze and extract data from websites.

  • Create a scraper on the Apify platform using a Python code template.

Top comments (0)