DEV Community

Cover image for HTML Parser - Extract HTML information with ease
Sm0ke
Sm0ke

Posted on • Updated on

HTML Parser - Extract HTML information with ease

Hello Coder,

This article presents a few practical code snippets to extract and process HTML information using an HTML Parser written in Python / BS4 library. Following topics will be covered:

  • ✅ Load the Html
  • ✅ Scan the file for assets: images, Javascript files, CSS files
  • ✅ Change the path of an existing asset
  • ✅ Update existing elements: change the src attribute of an image
  • ✅ Locate an element based on the id
  • ✅ Remove an element from the DOM tree
  • ✅ Process an existing component: remove hardcoded text
  • ✅ Save the processed HTML to a file

What is an HTML Parser

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here means to load the HTML, extract and process the relevant information like head title, page assets, main sections and later on, save the processed file.


Parser Environment

The code uses BeautifulSoup library, the well-known parsing library written in Python. To start coding, we need a few modules installed on our system.

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here 
Enter fullscreen mode Exit fullscreen mode

Load the HTML content

The file will be loaded as any other file, and the content should be injected into a BeautifulSoup object

from bs4 import BeautifulSoup as bs

# Load the HTML content
html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up

# Initialize the BS object
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library
Enter fullscreen mode Exit fullscreen mode

Parse the HTML for assets

At this point, we have the DOM tree loaded in the BeautifulSoup object. Let's scan the DOM tree for Javascript files, the script nodes:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...
Enter fullscreen mode Exit fullscreen mode

The code snippet that locates the Javascript has only a few lines of code. The BS library will return an array of objects and we can mutate each script node with ease:

for script in soup.body.find_all('script', recursive=False):

   # Print the src attribute
   print(' JS source = ' + script['src'])

   # Print the type attribute
   print(' JS type = ' + script['type'])   

Enter fullscreen mode Exit fullscreen mode

In a similar way, we can select and process the CSS nodes:

...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...
Enter fullscreen mode Exit fullscreen mode

And the code ..

for link in soup.find_all('link'):

   # Print the src attribute
   print(' CSS file = ' + script['href'])

Enter fullscreen mode Exit fullscreen mode

Parse the HTML for images

In this code snippet, we will mutate the node and change the src attribute of the image node

...
<img src="images/pic01.jpg" alt="Bred Pitt">
...
Enter fullscreen mode Exit fullscreen mode
for img in soup.body.find_all('img'):

   # Print the path 
   print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1] # extract the last segment, aka image file  
   img[src] = '/assets/img/' + img_file 
   # the new path is set 
Enter fullscreen mode Exit fullscreen mode

Locate an element based on the ID

This can be achieved by a single line of code. Let's imagine that we have an element (div or span) with the id 1234:

...
<div id="1234" class="handsome">
Some text
</div>

Enter fullscreen mode Exit fullscreen mode

And the code:

mydiv = soup.find("div", {"id": "1234"})

print(mydiv) 

# delete the element
mydiv.decompose()

Enter fullscreen mode Exit fullscreen mode

Remove the hard-coded text

This code snippet is useful for components extraction and translation to different template engines. Let's imagine that we have this simple component:

<div id="1234" class="cool">
   <span>Html Parsing</span>
   <span>the practical guide</span> 
</div>
Enter fullscreen mode Exit fullscreen mode

If we want to use this component in Php, the component becomes:

<div id="1234" class="cool">
   <span><?php echo $title ?></span>
   <span><?php echo $info ?></span> 
</div>
Enter fullscreen mode Exit fullscreen mode

Or for the Jinja2 (Python template engine)

<div id="1234" class="cool">
   <span>{{ title }}</span>
   <span>{{ info }}</span> 
</div>
Enter fullscreen mode Exit fullscreen mode

To void the manual work, we can use a code snippet to replace automatically the hardcoded texts and prepare the component for a specific template engine:

# locate the div
mydiv = soup.find("div", {"id": "1234"})

print(mydiv) # print before processing

# iterate on div elements
for tag in mydiv.descendants:

   # NavigableString is the text inside the tag, 
   # not the tag himself 
   if not isinstance(tag, NavigableString):

      print( 'Found tag = ' + tag.name ' -> ' + tag.text )
      # this will print:
      # Found tag = span ->  Html Parsing
      # Found tag = span ->  the practical guide

      # replace the text for Php
      tag.text = '<?php echo $title ?>'

      # replace the text for Jinja
      tag.text = '{{ title }}'    
Enter fullscreen mode Exit fullscreen mode

To use the component, we can save the component to a file:


# mydiv is the processed component
php_component is the string representation
php_component = mydiv.prettify(formatter="html") 

file = open( 'component.php', 'w+') 
file.write( php_component )
file.close()

Enter fullscreen mode Exit fullscreen mode

At this point, the original div is extracted from the DOM, with hard-coded texts removed, and ready to be used in a Php or Python project.


Save the new HTML

Now we have the mutated DOM in a BeautifulSoup object, in memory. To save the content to a new file, we need to call the prettify() and save the content to a new HTML file.


new_dom_content = soup.prettify(formatter="html") 

file = open( 'index_parsed.html', 'w+') 
file.write( new_dom_content )
file.close()

Enter fullscreen mode Exit fullscreen mode

HTML Parser - Use Cases

I'm using HTML parsing quite a lot, especially for tasks where manually work is involved:

  • process HTML themes to be used in a new project
  • extract hard-coded texts and extract components
  • translate flat HTML themes to Jinja, Mustache or PUG templates

From time to time, I'm publishing free samples in this public repository.

Resources


Thank you! Btw, my (nick) name is Sm0ke and I'm pretty active also on Twitter.

Latest comments (4)

Collapse
 
mesuzy profile image
me-suzy

hello, can anyone have a solution for how to parse multiple content of html pages to another html pages with the same link address?

please see this link:

stackoverflow.com/questions/661012...

Collapse
 
chrisachard profile image
Chris Achard

Wow, BeautifulSoup makes that super easy! Do you ever find edge cases where it doesn't work well at all? Or does it manage to handle most sites that you've tried? Thanks!

Collapse
 
sm0ke profile image
Sm0ke • Edited

Hello @chris ,
Based on my experience, BS was failing when I didn't respect the syntax or something similar. I remember a dummy case when I initialized the BS object using lxlml parser and the saved HTML had always a closing tag:

Sample: <meta ...></meta>
It was my fault all the way :). Now I'm using html-parser to construct the BS objects.
Thank you for your interest.

Collapse
 
chrisachard profile image
Chris Achard

Ah, makes sense. Thanks!