Introduction
When working with HTML data, you often need to clean up the tags and retain only the plain text. Whether it's for data analysis, automation, or simply making content readable, this task is common for developers.
In this article, I'll show you how to create a simple Python class to extract plain text from HTML using HTMLParser
, a built-in Python module.
Why Use HTMLParser
?
HTMLParser
is a lightweight and built-in Python module that allows you to parse and manipulate HTML documents. Unlike external libraries like BeautifulSoup, it's lightweight and ideal for simple tasks like HTML tag cleaning.
The Solution: A Simple Python Class
Step 1: Create the HTMLTextExtractor
Class
from html.parser import HTMLParser
class HTMLTextExtractor(HTMLParser):
"""Class for extracting plain text from HTML content."""
def __init__(self):
super().__init__()
self.text = []
def handle_data(self, data):
self.text.append(data.strip())
def get_text(self):
return ''.join(self.text)
This class does three main things:
- Initializes a list
self.text
to store extracted text. - Uses the
handle_data
method to capture all plain text found between HTML tags. - Combines all the text fragments with the
get_text
method.
Step 2: Use the Class to Extract Text
Here's how you can use the class to clean up HTML:
raw_description = """
<div>
<h1>Welcome to our website!</h1>
<p>We offer <strong>exceptional services</strong> for our customers.</p>
<p>Contact us at: <a href="mailto:contact@example.com">contact@example.com</a></p>
</div>
"""
extractor = HTMLTextExtractor()
extractor.feed(raw_description)
description = extractor.get_text()
print(description)
Output:
Welcome to our website! We offer exceptional services for our customers.Contact us at: contact@example.com
Adding Support for Attributes
If you want to capture additional information, such as links in tags, here's an enhanced version of the class:
class HTMLTextExtractor(HTMLParser):
"""Class for extracting plain text and links from HTML content."""
def __init__(self):
super().__init__()
self.text = []
def handle_data(self, data):
self.text.append(data.strip())
def handle_starttag(self, tag, attrs):
if tag == 'a':
for attr, value in attrs:
if attr == 'href':
self.text.append(f" (link: {value})")
def get_text(self):
return ''.join(self.text)
Enhanced Output:
Welcome to our website!We offer exceptional services for our customers.Contact us at: contact@example.com (link: mailto:contact@example.com)
## Use Cases
- **SEO**: Clean HTML tags to analyze the plain text content of a webpage.
- **Emails**: Transform HTML emails into plain text for basic email clients.
- **Scraping**: Extract important data from web pages for analysis or storage.
- **Automated Reports**: Simplify API responses containing HTML into readable text.
Advantages of This Approach
-
Lightweight: No need for external libraries; it's built on Python's native
HTMLParser
. - Ease of Use: Encapsulates the logic in a simple and reusable class.
- Customizable: Easily extend the functionality to capture specific information like attributes or additional tag data.
## Limitations and Alternatives
While `HTMLParser` is simple and efficient, it has some limitations:
- **Complex HTML**: It may struggle with very complex or poorly formatted HTML documents.
- **Limited Features**: It doesn't provide advanced parsing features like CSS selectors or DOM tree manipulation.
### Alternatives
If you need more robust features, consider using these libraries:
- **BeautifulSoup**: Excellent for complex HTML parsing and manipulation.
- **lxml**: Known for its speed and support for both XML and HTML parsing.
Conclusion
With this solution, you can easily extract plain text from HTML in just a few lines of code. Whether you're working on a personal project or a professional task, this approach is perfect for lightweight HTML cleaning and analysis.
If your use case involves more complex or malformed HTML, consider using libraries like BeautifulSoup or lxml for enhanced functionality.
Feel free to try this code in your projects and share your experiences. Happy coding! 🚀
Top comments (0)