What is an HTML Parser
According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here is basically, crawling the HTML code and extracting, processing relevant information like head title, page assets, main sections.
To execute the sample code, we need a Python environment and a few useful dependencies:
$ pip install ipython # the console where we execute the code $ pip install requests # a library to pull the entire HTML page $ pip install BeautifulSoup # the real magic is here
If all goes well, we can start coding. Please type
ipython to start the interactive Python console:
# import libraries import requests from bs4 import BeautifulSoup as bs # define the URL to crawl & parse # feel free to change this URL with your own app app_url = 'https://flask-bulma-css.appseed.us/' # crawling the page. This might take a few seconds page = requests.get( app_url ) # to check the crawl status, just type: page <Response > # all good # to print the page contents type: page.content
At this point, we have the page content, let's inject the HTML into BeautifulSoup and get some information from the remote page.
soup = bs(page.content, 'html.parser') # print the entire page head soup.head # print only the title soup.head.title <title>Flask Bulma CSS - BulmaPlay Open-Source App </title>
To check the accuracy of the result we can check out the source of the page.
# the code for script in soup.body.find_all('script', recursive=False): print(' Js = ' + script['src']) # the output Js = /static/assets/js/jquery.min.js Js = /static/assets/js/jquery.lazy.min.js Js = /static/assets/js/slick.min.js Js = /static/assets/js/scrollreveal.min.js Js = /static/assets/js/jquery.waypoints.min.js Js = /static/assets/js/jquery.waypoints-sticky.min.js Js = /static/assets/js/jquery.counterup.min.js Js = https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.8.3/modernizr.min.js Js = /static/assets/js/app.js
Let's print the content of the
# app_url is initialized a few line above: # app_url = 'https://flask-bulma-css.appseed.us/' app_js = requests.get(app_url + '/static/assets/js/app.js') # to check the status, just type the name of the object app_js <Response > # all good, let's print the content of the remote file app_js.content # some unminified js code will be listed here.
Let's print level one elements from the page body:
# the code for elem in soup.body.children: if elem.name: # we need this check, some elements don't have name print( ' -> elem ' + elem.name ) # the output -> elem div -> elem section -> elem section -> elem section -> elem section -> elem section -> elem section -> elem footer -> elem div -> elem div -> elem div -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script
Let's print the footer:
soup.footer # to have a nice print of elements, we can use BS prettify() helper # using prettify(), the output is nicely indented print(soup.footer.prettify()) # the output <footer class="footer footer-dark"> <div class="container"> <div class="columns"> <div class="column"> <div class="footer-logo"> <img alt="Footer Logo for BulmaPlay - JAMStack Bulma CSS Web App." src="/static/assets/images/logos/bulmaplay-logo.png"/> </div> .... </div> </div> </div> </div> </footer>
And the last code snippet, let's print the anchors referred in the footer section:
# the code for elem in soup.body.footer.find_all('a'): print(' footer href = ' + elem['href']) # the output footer href = https://bulma.io footer href = https://github.com/app-generator/flask-bulma-css footer href = https://appseed.us/apps/bulma-css?flask-bulma-css footer href = https://blog.appseed.us/tag/bulma-css footer href = https://absurd.design/ footer href = https://github.com/cssninjaStudio/fresh
- Developer Tools - Open-Source HTML Parser - related article
- HTML Parser - Extract HTML information with ease - A few practical code snippets to extract and process HTML information
- HTML Parser - How to use Python BS4 to work less
- HTML Parser - used by the AppSeed App Generator to parse flat HTML
- BeautifulSoup Html Parser documentation
- HTML Parser sources - the official public repository
- HTML Parser provided by AppSeed
- HTML Parser - Convert HTML to Jinja2 and Php components - related blog article
- Video presentation HTML parsing and components extraction
Useful? AMA in the comments. Thank you & happy HTML parsing!