In this article, I will present a simple HTML Parser used by me to integrate much faster HTML themes into legacy apps, coded in different technologies. When a customer requests a new UI for his app, the manual processing can take some time, and I decided to automate a little bit of the whole flow. Using the tool, I'm able to update the design in less than 2h for a simple website with 2/3 pages.
Note: the tool is not open-source, but I will consider releasing a light version as an open-source project in the future. In the HTML Parser public repository, I will publish processed HTML themes converted to PUG, Jinja, and Blade to be used by anyone.
HTML Parser features
- Normalize the HTML file to load the assets from a standard directories ( /assets/ [ img, js, css ] ) making the integration in webpack related tools much easier
- Edit / traverse the HTML tree
- Edit attributes like anchor HREF, span texts, remove elements, edit class names
- Extract components for production use for various engines like PUG, Jinja2, Blade
- Migrate legacy Bootstrap layouts to Bulma and Tailwind CSS frameworks
HTML Parser Implementation
In order to process the HTML and process the HTML tree, we need to load first the whole file. BeautifulSoup has a simple constructor that accepts the string to parse and load into memory and the desired parser.
# read_file retun the file content as string html_content = read_file('index.html') soup = bs(html_content,'html.parser') # At this point, we can interact with the HTML # elements stored in memory using all helpers offered by BS library
BeautifulSoup library supports more than one parser (e.g. lxml, xml, html5lib), the differences between them become clear on non well-formed HTML documents. For instance, lxml will add missing closing tags for all elements. For more information please access the dedicated section in the documentation regarding this topic.
To select the whole HEAD node, and interact with all elements we need to write just a few lines of code:
header = soup.find('head') # If we want to change the title header.title.string.replace_with('My new title')
To scan the HTML soup for script tags, we can use the
for script in soup.body.find_all('script', recursive=False): # Print the path print(' JS source = ' + script[src]) # Update (normalize) the path js_path = script['src'] js_file = js_path.split('/')[-1] # select the last segment script[src] = '/assets/js/' + js_file
Using the same technique as for JS files, we can normalize the Images to be loaded from a standard directory.
for img in soup.body.find_all('img'): # Print the path print(' IMG src = ' + img[src]) img_path = img['src'] img_file = img_path.split('/')[-1] img[src] = '/assets/img/' + img_file
All our changes are made in memory. To make these changes permanent we need to extract the string representation of our processed HTML from BS, and dump it into a file for later usage:
processed_html = soup.prettify(formatter="html") f = open( 'index2.html', 'w+') f.write(processed_html) f.close
- Index file: original version and normalized version
- JSON descriptor is generated by the HTML parser tool and encapsulate the assets and resources used by the HTML files
- Navigation component
nav#nav ul li a.active.newclass(href='https://appseed.us/html-parser'). Introduction li a(href='#first'). First Section li a(href='#second'). Second Section li a(href='#cta'). Get Started
<nav id="nav"> <ul> <li> <a class="active newclass" href="https://appseed.us/html-parser"> <?php echo $var_1?> </a> </li> <li> <a href="#first"> <?php echo $var_2?> </a> </li> <li> <a href="#second"> <?php echo $var_3?> </a> </li> <li> <a href="#cta"> <?php echo $var_4?> </a> </li> </ul> </nav>
All are open-source, with live DEMO.
- JAMstack Fractal - HTML5Up design coded in JAMstack pattern
- JAMstack BigPicture - HTML5Up design coded in JAMstack pattern
- JAMstack Landed - HTML5Up Landed design coded in JAMstack pattern
- Flask Dashboard Material Design - Admin Dashboard with Material Design
- Flask Dashboard NowUI - Admin Dashboard with NowUI Design
- Flask Dashboard Black - Open-Source Admin Panel
- Flask Dashboard Argon - Open-Source Admin Panel
- Flask Dashboard Light - Open-Source Admin Panel
- HTML Parser - How to use Python BS4 to work less
- Developer Tools - Open-Source HTML Parser - related article
- HTML Parser - used by the AppSeed App Generator to parse flat HTML
- BeautifulSoup Html Parser documentation
- HTML Parser sources - the official public repository
- HTML Parser provided by AppSeed
- HTML Parser - Convert HTML to Jinja2 and Php components - related blog article
- Video presentation HTML parsing and components extraction