What is an HTML Parser
According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here is basically, crawling the HTML code and extracting, processing relevant information like head title, page assets, main sections.
A note: All code snippets listed on this page are used in production for a real-life project presented in a previous article: HTML Parser - Developer Tools
To execute the sample code, we need a Python environment and a few useful dependencies:
$ pip install ipython # the console where we execute the code $ pip install requests # a library to pull the entire HTML page $ pip install BeautifulSoup # the real magic is here
If all goes well, we can start coding. Please type
ipython to start the interactive Python console:
# import libraries import requests from bs4 import BeautifulSoup as bs # define the URL to crawl & parse # feel free to change this URL with your own app app_url = 'https://flask-bulma-css.appseed.us/' # crawling the page. This might take a few seconds page = requests.get( app_url ) # to check the crawl status, just type: page <Response > # all good # to print the page contents type: page.content
At this point, we have the page content, let's inject the HTML into BeautifulSoup and get some information from the remote page.
soup = bs(page.content, 'html.parser') # print the entire page head soup.head # print only the title soup.head.title <title>Flask Bulma CSS - BulmaPlay Open-Source App </title>
To check the accuracy of the result we can check out the source of the page.
# the code for script in soup.body.find_all('script', recursive=False): print(' Js = ' + script['src']) # the output Js = /static/assets/js/jquery.min.js Js = /static/assets/js/jquery.lazy.min.js Js = /static/assets/js/slick.min.js Js = /static/assets/js/scrollreveal.min.js Js = /static/assets/js/jquery.waypoints.min.js Js = /static/assets/js/jquery.waypoints-sticky.min.js Js = /static/assets/js/jquery.counterup.min.js Js = https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.8.3/modernizr.min.js Js = /static/assets/js/app.js
Let's print the content of the
# app_url is initialized a few line above: # app_url = 'https://flask-bulma-css.appseed.us/' app_js = requests.get(app_url + '/static/assets/js/app.js') # to check the status, just type the name of the object app_js <Response > # all good, let's print the content of the remote file app_js.content # some unminified js code will be listed here.
Let's print level one elements from the page body:
# the code for elem in soup.body.children: if elem.name: # we need this check, some elements don't have name print( ' -> elem ' + elem.name ) # the output -> elem div -> elem section -> elem section -> elem section -> elem section -> elem section -> elem section -> elem footer -> elem div -> elem div -> elem div -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script -> elem script
Let's print the footer:
soup.footer # to have a nice print of elements, we can use BS prettify() helper # using prettify(), the output is nicely indented print(soup.footer.prettify()) # the output <footer class="footer footer-dark"> <div class="container"> <div class="columns"> <div class="column"> <div class="footer-logo"> <img alt="Footer Logo for BulmaPlay - JAMStack Bulma CSS Web App." src="/static/assets/images/logos/bulmaplay-logo.png"/> </div> .... </div> </div> </div> </div> </footer>
And the last code snippet, let's print the anchors referred in the footer section:
# the code for elem in soup.body.footer.find_all('a'): print(' footer href = ' + elem['href']) # the output footer href = https://bulma.io footer href = https://github.com/app-generator/flask-bulma-css footer href = https://appseed.us/apps/bulma-css?flask-bulma-css footer href = https://blog.appseed.us/tag/bulma-css footer href = https://absurd.design/ footer href = https://github.com/cssninjaStudio/fresh
- BeautifulSoup Html Parser documentation
- HTML Parser sources - the official public repository
- HTML Parser provided by AppSeed
- HTML Parser - Convert HTML to Jinja2 and Php components - related blog article
- Video presentation HTML parsing and components extraction
Useful? AMA in the comments. Thank you & happy parsing!