Sm0ke

Posted on Jul 23, 2019 • Edited on Oct 14, 2021

HTML Parser - Developer Tools

#htmlparser #tools #templates #appseed

Hello Coders,

In this article, I will present a simple HTML Parser used by me to integrate much faster HTML themes into legacy apps, coded in different technologies. When a customer requests a new UI for his app, the manual processing can take some time, and I decided to automate a little bit of the whole flow. Using the tool, I'm able to update the design in less than 2h for a simple website with 2/3 pages.

Thanks for reading! - Content provided by App Generator.

Main Feature

The tool converts flat HTML to production-ready components for different engines: PUG, Jinja2, Blade, Mustache, Core Php.

Note: the tool is not open-source, but I will consider releasing a light version as an open-source project in the future. In the HTML Parser public repository, I will publish processed HTML themes converted to PUG, Jinja, and Blade to be used by anyone.

Technologies

The HTML parser Tool is developed in Python3 / BeautifulSoup library as an interactive console. I was able to use the tool for real projects after 3mo of R&D work.

HTML Parser features

Normalize the HTML file to load the assets from a standard directories ( /assets/ [ img, js, css ] ) making the integration in webpack related tools much easier
Edit / traverse the HTML tree
Edit attributes like anchor HREF, span texts, remove elements, edit class names
Extract components for production use for various engines like PUG, Jinja2, Blade
Migrate legacy Bootstrap layouts to Bulma and Tailwind CSS frameworks

HTML Parser Implementation

In order to process the HTML and process the HTML tree, we need to load first the whole file. BeautifulSoup has a simple constructor that accepts the string to parse and load into memory and the desired parser.

Load HTML in memory

# read_file retun the file content as string
html_content = read_file('index.html')
soup  = bs(html_content,'html.parser') 
# At this point, we can interact with the HTML 
# elements stored in memory using all helpers offered by BS library

BeautifulSoup library supports more than one parser (e.g. lxml, xml, html5lib), the differences between them become clear on non well-formed HTML documents. For instance, lxml will add missing closing tags for all elements. For more information please access the dedicated section in the documentation regarding this topic.

Parse Head section

To select the whole HEAD node, and interact with all elements we need to write just a few lines of code:

header = soup.find('head')

# If we want to change the title

header.title.string.replace_with('My new title')

Parse HTML for JS Scripts

Javascript files are present in the HTML using script nodes:

...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...

To scan the HTML soup for script tags, we can use the find_all helper:

for script in soup.body.find_all('script', recursive=False):

   # Print the path 
   print(' JS source = ' + script[src]) 

   # Update (normalize) the path
   js_path = script['src']
   js_file = js_path.split('/')[-1] # select the last segment
   script[src] = '/assets/js/' + js_file

Parse HTML for Images

Using the same technique as for JS files, we can normalize the Images to be loaded from a standard directory.

for img in soup.body.find_all('img'):

   # Print the path 
   print(' IMG src = ' + img[src]) 

   img_path = img['src']
   img_file = img_path.split('/')[-1]  
   img[src] = '/assets/img/' + img_file

Save the HTML

All our changes are made in memory. To make these changes permanent we need to extract the string representation of our processed HTML from BS, and dump it into a file for later usage:

processed_html = soup.prettify(formatter="html")
f = open( 'index2.html', 'w+')
f.write(processed_html)
f.close

Real life sample

The sample, extracted from Stellar HTML5Up theme is a simple navigation bar, extracted from this file

Index file: original version and normalized version
JSON descriptor is generated by the HTML parser tool and encapsulate the assets and resources used by the HTML files
Navigation component

Pug version

nav#nav
  ul
    li
      a.active.newclass(href='https://appseed.us/html-parser').

        Introduction

    li
      a(href='#first').

        First Section

    li
      a(href='#second').

        Second Section

    li
      a(href='#cta').

        Get Started

PHP version

<nav id="nav">
 <ul>
  <li>
   <a class="active newclass" href="https://appseed.us/html-parser">
    <?php echo $var_1?>
   </a>
  </li>
  <li>
   <a href="#first">
    <?php echo $var_2?>
   </a>
  </li>
  <li>
   <a href="#second">
    <?php echo $var_3?>
   </a>
  </li>
  <li>
   <a href="#cta">
    <?php echo $var_4?>
   </a>
  </li>
 </ul>
</nav>

Projects built with this tool

All are open-source, with live DEMO.

JAMstack Fractal - HTML5Up design coded in JAMstack pattern
JAMstack BigPicture - HTML5Up design coded in JAMstack pattern
JAMstack Landed - HTML5Up Landed design coded in JAMstack pattern
Flask Dashboard Material Design - Admin Dashboard with Material Design
Flask Dashboard NowUI - Admin Dashboard with NowUI Design
Flask Dashboard Black - Open-Source Admin Panel
Flask Dashboard Argon - Open-Source Admin Panel
Flask Dashboard Light - Open-Source Admin Panel

Resources

HTML Parser - How to use Python BS4 to work less
Developer Tools - Open-Source HTML Parser - related article
HTML Parser - used by the AppSeed App Generator to parse flat HTML
BeautifulSoup Html Parser documentation
HTML Parser sources - the official public repository
HTML Parser provided by AppSeed
HTML Parser - Convert HTML to Jinja2 and Php components - related blog article
Video presentation HTML parsing and components extraction

Thank you!

Oldest comments (5)

Scott Simontis • Jul 24 '19

Hey Sm0ke,

Thank you for sharing this article with the community, but is there any chance you can share a little more? I know that the proprietary licensing prevents you from releasing the code, but could you discuss some of the algorithms that were used or perhaps use pseudocode to demonstrate how certain sections work? Otherwise, I am afraid your article might violate section 11 of the community's Terms of Use:

Users must make a good-faith effort to share content that is on-topic, of high->quality, and is not designed primarily for the purposes of promotion or creating >backlinks.

Sm0ke • Jul 24 '19

Hello @ssimontis ,

I will add more information regarding tool architecture & use.
As I mentioned the tool will provide some free assets to developers:

a light version to be used by anyone
processed themes released under a permissive license translated to Jinja, PUG .. etc.

Until then, I will add more information regarding the tool algorithms,
to make more useful to the audience.

Thank you!

Scott Simontis • Jul 24 '19

Thank you, that's awesome!

Sm0ke • Jul 29 '19

Hello @ssimontis ,
As promised, I've added more information regarding the HTML parser internals. Tell me if you find useful the updates or suggest more topics to be added.
Happy parsing!

Scott Simontis • Jul 29 '19

Thank you so much for doing that! I found it very useful and I appreciate the time you put into it! Actually hoping I can play with parsing later today, one coding assessment left to complete for interviews...