DEV Community

A0mineTV
A0mineTV

Posted on

Extracting Text from HTML Content in Python: A Simple Solution with `HTMLParser`

Introduction

When working with HTML data, you often need to clean up the tags and retain only the plain text. Whether it's for data analysis, automation, or simply making content readable, this task is common for developers.

In this article, I'll show you how to create a simple Python class to extract plain text from HTML using HTMLParser, a built-in Python module.


Why Use HTMLParser?

HTMLParser is a lightweight and built-in Python module that allows you to parse and manipulate HTML documents. Unlike external libraries like BeautifulSoup, it's lightweight and ideal for simple tasks like HTML tag cleaning.


The Solution: A Simple Python Class

Step 1: Create the HTMLTextExtractor Class

from html.parser import HTMLParser

class HTMLTextExtractor(HTMLParser):
    """Class for extracting plain text from HTML content."""

    def __init__(self):
        super().__init__()
        self.text = []

    def handle_data(self, data):
        self.text.append(data.strip())

    def get_text(self):
        return ''.join(self.text)
Enter fullscreen mode Exit fullscreen mode

This class does three main things:

  1. Initializes a list self.text to store extracted text.
  2. Uses the handle_data method to capture all plain text found between HTML tags.
  3. Combines all the text fragments with the get_text method.

Step 2: Use the Class to Extract Text

Here's how you can use the class to clean up HTML:

raw_description = """
<div>
    <h1>Welcome to our website!</h1>
    <p>We offer <strong>exceptional services</strong> for our customers.</p>
    <p>Contact us at: <a href="mailto:contact@example.com">contact@example.com</a></p>
</div>
"""

extractor = HTMLTextExtractor()
extractor.feed(raw_description)
description = extractor.get_text()

print(description)
Enter fullscreen mode Exit fullscreen mode

Output:

Welcome to our website! We offer exceptional services for our customers.Contact us at: contact@example.com
Enter fullscreen mode Exit fullscreen mode

Adding Support for Attributes

If you want to capture additional information, such as links in tags, here's an enhanced version of the class:

class HTMLTextExtractor(HTMLParser):
    """Class for extracting plain text and links from HTML content."""

    def __init__(self):
        super().__init__()
        self.text = []

    def handle_data(self, data):
        self.text.append(data.strip())

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr, value in attrs:
                if attr == 'href':
                    self.text.append(f" (link: {value})")

    def get_text(self):
        return ''.join(self.text)
Enter fullscreen mode Exit fullscreen mode

Enhanced Output:

Welcome to our website!We offer exceptional services for our customers.Contact us at: contact@example.com (link: mailto:contact@example.com)
Enter fullscreen mode Exit fullscreen mode

## Use Cases

- **SEO**: Clean HTML tags to analyze the plain text content of a webpage.
- **Emails**: Transform HTML emails into plain text for basic email clients.
- **Scraping**: Extract important data from web pages for analysis or storage.
- **Automated Reports**: Simplify API responses containing HTML into readable text.
Enter fullscreen mode Exit fullscreen mode

Advantages of This Approach

  • Lightweight: No need for external libraries; it's built on Python's native HTMLParser.
  • Ease of Use: Encapsulates the logic in a simple and reusable class.
  • Customizable: Easily extend the functionality to capture specific information like attributes or additional tag data.

## Limitations and Alternatives

While `HTMLParser` is simple and efficient, it has some limitations:

- **Complex HTML**: It may struggle with very complex or poorly formatted HTML documents.
- **Limited Features**: It doesn't provide advanced parsing features like CSS selectors or DOM tree manipulation.

### Alternatives

If you need more robust features, consider using these libraries:

- **BeautifulSoup**: Excellent for complex HTML parsing and manipulation.
- **lxml**: Known for its speed and support for both XML and HTML parsing.
Enter fullscreen mode Exit fullscreen mode

Conclusion

With this solution, you can easily extract plain text from HTML in just a few lines of code. Whether you're working on a personal project or a professional task, this approach is perfect for lightweight HTML cleaning and analysis.

If your use case involves more complex or malformed HTML, consider using libraries like BeautifulSoup or lxml for enhanced functionality.

Feel free to try this code in your projects and share your experiences. Happy coding! 🚀

Top comments (0)