loading...
Cover image for HTML Parser - Extract HTML information with ease

HTML Parser - Extract HTML information with ease

sm0ke profile image Sm0ke Updated on ・5 min read

Hello Coder,

This article presents a few practical code snippets to extract and process HTML information using an HTML Parser written in Python / BS4 library. Following topics will be covered:

  • Load the Html
  • Scan the file for assets: images, Javascript files, CSS files
  • Change the path of an existing asset
  • Update existing elements: change the src attribute of an image
  • Locate an element based on the id
  • Remove an element from the DOM tree
  • Process an existing component: remove hardcoded text
  • Save the processed HTML to a file

Want to learn more about this topic? Access the AppSeed platform for more articles related to HTML parsing and developer tools. Thank you!


What is an HTML Parser

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here means to load the HTML, extract and process the relevant information like head title, page assets, main sections and later on, save the processed file.


Parser Environment

The code uses BeautifulSoup library, the well-known parsing library written in Python. To start coding, we need a few modules installed on our system.

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here 

Load the HTML content

The file will be loaded as any other file, and the content should be injected into a BeautifulSoup object

from bs4 import BeautifulSoup as bs

# Load the HTML content
html_file = open('index.html', 'r')
html_content = html_file.read()
html_file.close() # clean up

# Initialize the BS object
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library

Parse the HTML for assets

At this point, we have the DOM tree loaded in the BeautifulSoup object. Let's scan the DOM tree for Javascript files, the script nodes:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

The code snippet that locates the Javascript has only a few lines of code. The BS library will return an array of objects and we can mutate each script node with ease:

for script in soup.body.find_all('script', recursive=False):

   # Print the src attribute
   print(' JS source = ' + script['src'])

   # Print the type attribute
   print(' JS type = ' + script['type'])   

In a similar way, we can select and process the CSS nodes:

...
<link rel="stylesheet" href="css/bootstrap.min.css">
<link rel="stylesheet" href="css/app.css">
...

And the code ..

for link in soup.find_all('link'):

   # Print the src attribute
   print(' CSS file = ' + script['href'])


Parse the HTML for images

In this code snippet, we will mutate the node and change the src attribute of the image node

...
<img src="images/pic01.jpg" alt="Bred Pitt">
...
for img in soup.body.find_all('img'):

   # Print the path 
   print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1] # extract the last segment, aka image file  
   img[src] = '/assets/img/' + img_file 
   # the new path is set 

Locate an element based on the ID

This can be achieved by a single line of code. Let's imagine that we have an element (div or span) with the id 1234:

...
<div id="1234" class="handsome">
Some text
</div>

And the code:

mydiv = soup.find("div", {"id": "1234"})

print(mydiv) 

# delete the element
mydiv.decompose()


Remove the hard-coded text

This code snippet is useful for components extraction and translation to different template engines. Let's imagine that we have this simple component:

<div id="1234" class="cool">
   <span>Html Parsing</span>
   <span>the practical guide</span> 
</div>

If we want to use this component in Php, the component becomes:

<div id="1234" class="cool">
   <span><?php echo $title ?></span>
   <span><?php echo $info ?></span> 
</div>

Or for the Jinja2 (Python template engine)

<div id="1234" class="cool">
   <span>{{ title }}</span>
   <span>{{ info }}</span> 
</div>

To void the manual work, we can use a code snippet to replace automatically the hardcoded texts and prepare the component for a specific template engine:

# locate the div
mydiv = soup.find("div", {"id": "1234"})

print(mydiv) # print before processing

# iterate on div elements
for tag in mydiv.descendants:

   # NavigableString is the text inside the tag, 
   # not the tag himself 
   if not isinstance(tag, NavigableString):

      print( 'Found tag = ' + tag.name ' -> ' + tag.text )
      # this will print:
      # Found tag = span ->  Html Parsing
      # Found tag = span ->  the practical guide

      # replace the text for Php
      tag.text = '<?php echo $title ?>'

      # replace the text for Jinja
      tag.text = '{{ title }}'    

To use the component, we can save the component to a file:


# mydiv is the processed component
php_component is the string representation
php_component = mydiv.prettify(formatter="html") 

file = open( 'component.php', 'w+') 
file.write( php_component )
file.close()

At this point, the original div is extracted from the DOM, with hard-coded texts removed, and ready to be used in a Php or Python project.


Save the new HTML

Now we have the mutated DOM in a BeautifulSoup object, in memory. To save the content to a new file, we need to call the prettify() and save the content to a new HTML file.


new_dom_content = soup.prettify(formatter="html") 

file = open( 'index_parsed.html', 'w+') 
file.write( new_dom_content )
file.close()


HTML Parser - Use Cases

I'm using HTML parsing quite a lot, especially for tasks where manually work is involved:

  • process HTML themes to be used in a new project
  • extract hard-coded texts and extract components
  • translate flat HTML themes to Jinja, Mustache or PUG templates

From time to time, I'm publishing free samples in this public repository.

Resources


Thank You!

Posted on by:

sm0ke profile

Sm0ke

@sm0ke

#Automation, my favorite programming language

Discussion

markdown guide
 

Wow, BeautifulSoup makes that super easy! Do you ever find edge cases where it doesn't work well at all? Or does it manage to handle most sites that you've tried? Thanks!

 

Hello @chris ,
Based on my experience, BS was failing when I didn't respect the syntax or something similar. I remember a dummy case when I initialized the BS object using lxlml parser and the saved HTML had always a closing tag:

Sample: <meta ...></meta>
It was my fault all the way :). Now I'm using html-parser to construct the BS objects.
Thank you for your interest.