DEV Community

Cover image for HTML Parser - Developer Tools
Sm0ke
Sm0ke

Posted on • Updated on

HTML Parser - Developer Tools

Hello Coders,

In this article, I will present a simple HTML Parser used by me to integrate much faster HTML themes into legacy apps, coded in different technologies. When a customer requests a new UI for his app, the manual processing can take some time, and I decided to automate a little bit of the whole flow. Using the tool, I'm able to update the design in less than 2h for a simple website with 2/3 pages.

Thanks for reading! - Content provided by App Generator.


Main Feature

The tool converts flat HTML to production-ready components for different engines: PUG, Jinja2, Blade, Mustache, Core Php.


Note: the tool is not open-source, but I will consider releasing a light version as an open-source project in the future. In the HTML Parser public repository, I will publish processed HTML themes converted to PUG, Jinja, and Blade to be used by anyone.


Technologies

The HTML parser Tool is developed in Python3 / BeautifulSoup library as an interactive console. I was able to use the tool for real projects after 3mo of R&D work.


HTML Parser features

  • Normalize the HTML file to load the assets from a standard directories ( /assets/ [ img, js, css ] ) making the integration in webpack related tools much easier
  • Edit / traverse the HTML tree
  • Edit attributes like anchor HREF, span texts, remove elements, edit class names
  • Extract components for production use for various engines like PUG, Jinja2, Blade
  • Migrate legacy Bootstrap layouts to Bulma and Tailwind CSS frameworks

HTML Parser Implementation

In order to process the HTML and process the HTML tree, we need to load first the whole file. BeautifulSoup has a simple constructor that accepts the string to parse and load into memory and the desired parser.

Load HTML in memory

# read_file retun the file content as string
html_content = read_file('index.html')
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library
Enter fullscreen mode Exit fullscreen mode

BeautifulSoup library supports more than one parser (e.g. lxml, xml, html5lib), the differences between them become clear on non well-formed HTML documents. For instance, lxml will add missing closing tags for all elements. For more information please access the dedicated section in the documentation regarding this topic.

Parse Head section

To select the whole HEAD node, and interact with all elements we need to write just a few lines of code:

header = soup.find('head')

# If we want to change the title

header.title.string.replace_with('My new title') 
Enter fullscreen mode Exit fullscreen mode

Parse HTML for JS Scripts

Javascript files are present in the HTML using script nodes:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...
Enter fullscreen mode Exit fullscreen mode

To scan the HTML soup for script tags, we can use the find_all helper:

for script in soup.body.find_all('script', recursive=False):

   # Print the path 
   print(' JS source = ' + script[src]) 

   # Update (normalize) the path
   js_path = script['src']
   js_file = js_path.split('/')[-1] # select the last segment
   script[src] = '/assets/js/' + js_file
Enter fullscreen mode Exit fullscreen mode

Parse HTML for Images

Using the same technique as for JS files, we can normalize the Images to be loaded from a standard directory.

for img in soup.body.find_all('img'):

   # Print the path 
   print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1]  
   img[src] = '/assets/img/' + img_file
Enter fullscreen mode Exit fullscreen mode

Save the HTML

All our changes are made in memory. To make these changes permanent we need to extract the string representation of our processed HTML from BS, and dump it into a file for later usage:

processed_html = soup.prettify(formatter="html")
f = open( 'index2.html', 'w+')
f.write(processed_html)
f.close
Enter fullscreen mode Exit fullscreen mode

Real life sample

The sample, extracted from Stellar HTML5Up theme is a simple navigation bar, extracted from this file

Pug version

nav#nav
  ul
    li
      a.active.newclass(href='https://appseed.us/html-parser').

        Introduction

    li
      a(href='#first').

        First Section

    li
      a(href='#second').

        Second Section

    li
      a(href='#cta').

        Get Started
Enter fullscreen mode Exit fullscreen mode

PHP version

<nav id="nav">
 <ul>
  <li>
   <a class="active newclass" href="https://appseed.us/html-parser">
    <?php echo $var_1?>
   </a>
  </li>
  <li>
   <a href="#first">
    <?php echo $var_2?>
   </a>
  </li>
  <li>
   <a href="#second">
    <?php echo $var_3?>
   </a>
  </li>
  <li>
   <a href="#cta">
    <?php echo $var_4?>
   </a>
  </li>
 </ul>
</nav>
Enter fullscreen mode Exit fullscreen mode

Projects built with this tool

All are open-source, with live DEMO.


Resources


Thank you!

Oldest comments (5)

Collapse
 
ssimontis profile image
Scott Simontis

Hey Sm0ke,

Thank you for sharing this article with the community, but is there any chance you can share a little more? I know that the proprietary licensing prevents you from releasing the code, but could you discuss some of the algorithms that were used or perhaps use pseudocode to demonstrate how certain sections work? Otherwise, I am afraid your article might violate section 11 of the community's Terms of Use:

Users must make a good-faith effort to share content that is on-topic, of high->quality, and is not designed primarily for the purposes of promotion or creating >backlinks.

Collapse
 
sm0ke profile image
Sm0ke

Hello @ssimontis ,

I will add more information regarding tool architecture & use.
As I mentioned the tool will provide some free assets to developers:

  • a light version to be used by anyone
  • processed themes released under a permissive license translated to Jinja, PUG .. etc.

Until then, I will add more information regarding the tool algorithms,
to make more useful to the audience.

Thank you!

Collapse
 
ssimontis profile image
Scott Simontis

Thank you, that's awesome!

Thread Thread
 
sm0ke profile image
Sm0ke

Hello @ssimontis ,
As promised, I've added more information regarding the HTML parser internals. Tell me if you find useful the updates or suggest more topics to be added.
Happy parsing!

Thread Thread
 
ssimontis profile image
Scott Simontis

Thank you so much for doing that! I found it very useful and I appreciate the time you put into it! Actually hoping I can play with parsing later today, one coding assessment left to complete for interviews...