Hello Coders,
The article presents a simple, open-source tool that I'm using to statically analyze HTML files for missing assets and broken links, before using the files in real projects. This Html Parser is basically a Python3 wrapper over Beautiful Soup, the popular OSS parsing library for HTML files and XMLs. The source code can be found on Github released under EULA License.
Thank you! Content provided by AppSeed - App Generator.
Features
- Open-Source - can be also used for eLearning
- Works with directories - all HTML files are scanned
- Detects missing assets (JS, CSS, images ) for each page
- Detects broken links and suggest the right path
- Acceptable execution time - 100 Pages processed <1min
- Html Parser - source code
- Sample Output - captured from a real project
- EULA License - free for solo-developers, small companies, startUps, and NGOs
To use the tool we need to specify two things:
- The folder where HTML files are saved
- The assets folder - parent Directory for all JS, CSS, Images ..
Once we have provided this simple setup, we can call the scripts in the terminal:
$ python ./check-assets.py
HTML Parser - The Relevant Parts
To scan and correlate the information, the tool uses a few structures to save and reuse the relevant information and also perform simple operations over detected HTML files.
Hot it works
- define a map where the key is the file name
- associate a data structure to each file where the relevant information is stored and updated
- Each HTML file is scanned for assets and links
- Validate the information for each file and save the missing assets for each by looking on the disk
HTML Parser - Source Code
The relevant functions and code chunks are below. If something relevant is missing, feel free to ask for it in the comments section:
Read files from a directory
def get_files( aPath ):
FILES_LIST = []
for (root, dirs, files) in walk( aPath ):
FILES_LIST.extend( files )
break
return FILES_LIST
The structure/class to save the information for each file
class TMPL:
# constructor
def __init__(self, aFile=''):
self.file = aFile
self.title = ''
self.css = [] # All CSS Files
self.js = [] # All JS Files
self.img = [] # All Images
self.links = [] # All Links
self.err = [] # used to report missing assets
self.err_links = [] # used to report missing assets
# Used to have a string representation
def __repr__(self):
return "" + self.file + ' some other info'
Initiate Beautiful Soup object for each file
def get_bs( aFile ):
minified = htmlmin.minify( file_load( aFile ), remove_empty_space=True)
return bs(minified,'html.parser')
Scan each file for Links and assets
The results are injected into associated structures for each file.
# BS object is constructed and available for queries
soup = get_bs( FULL_PATH )
# Scan for CSS files
tmpl.css = get_css( soup )
# # Scan for JS files
tmpl.css = get_js( soup )
...
Links and images are scanned in the same way using simple helpers.
Once the information is saved, we can traverse the DOM using BS objects and perform mutations over elements.
HTML Parser - Sample output
To visualize a real production output, please access a sample file saved into the public repository: check assets - output
(env) PS > python.exe .\check-assets.py
Files (2)
['apps-calendar.html', 'index.html']
***** ***** *****
PROCESSING --> apps-calendar.html | files (1) remaining
PROCESSING --> index.html | files (0) remaining
PROCESSING --> apps-calendar.html
ERR - Missing Asset -> /static/assets/css/classic-horizontal/style-ERROR.css
ERR - Missing Asset -> /static/assets/images/logo-mini-ERROR.svg
PROCESSING --> index.html
ERR - Missing Asset -> /static/assets/images/favicon-ERROR.png
|
|- apps-calendar.html
| |
| |--- CSS: 6 file(s)
| | /static/assets/vendors/mdi/css/materialdesignicons.min.css
| | /static/assets/vendors/css/vendor.bundle.base.css
| | /static/assets/vendors/fullcalendar/fullcalendar.min.css
| | /static/assets/css/classic-horizontal/style.css
| | /static/assets/css/classic-horizontal/style-ERROR.css
| | /static/assets/images/favicon.png
|
...
Pages with errors: 2
|
|- apps-calendar.html
| | | /static/assets/css/classic-horizontal/style-ERROR.css
| | | /static/assets/images/logo-mini-ERROR.svg
|
|- index.html
| | | /static/assets/images/favicon-ERROR.png
The tool can be easily extended to LIVE websites using the existing core. In case any of you find it useful, feel free to suggest features in the comments section or push a PR on Github.
Thank you! - For more resources, please access:
- Beautiful Soup - the official docs
- AppSeed - for more tools and starters
Btw, my (nick) name is Sm0ke and I'm pretty active also on Twitter.
Top comments (3)
Awesome tool!
ty
Great and helpful tool👌