Sm0ke

Posted on Jul 31, 2019 • Edited on Feb 26, 2021

HTML Parser - Extract information from a LIVE website

#htmlparser #tools #python #appseed

Hello Coder,

In this article, I will present a short-list with code snippets useful to extract information from a live website. The code is written in Python on top of BeautifulSoup HTML Parsing library.

Thank you! Content provided by AppSeed - App Generator.

What is an HTML Parser

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here is basically, crawling the HTML code and extracting, processing relevant information like head title, page assets, main sections.

Setup the environment

To execute the sample code, we need a Python environment and a few useful dependencies:

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here

If all goes well, we can start coding. Please type ipython to start the interactive Python console:

# import libraries
import requests
from bs4 import BeautifulSoup as bs

# define the URL to crawl & parse
# feel free to change this URL with your own app
app_url = 'https://flask-bulma-css.appseed.us/'

# crawling the page. This might take a few seconds
page = requests.get( app_url )

# to check the crawl status, just type:
page
<Response [200]> # all good

# to print the page contents type:
page.content

At this point, we have the page content, let's inject the HTML into BeautifulSoup and get some information from the remote page.

soup = bs(page.content, 'html.parser')

# print the entire page head
soup.head

# print only the title
soup.head.title
<title>Flask Bulma CSS - BulmaPlay Open-Source App </title>

To check the accuracy of the result we can check out the source of the page.

Where to go from here?

Using BS library we can easily manipulate the DOM. For instance, let's print the Javascript files used by the HTML file, using just a few lines of code:


# the code
for script in soup.body.find_all('script', recursive=False):
    print(' Js = ' + script['src'])

# the output
 Js = /static/assets/js/jquery.min.js
 Js = /static/assets/js/jquery.lazy.min.js
 Js = /static/assets/js/slick.min.js
 Js = /static/assets/js/scrollreveal.min.js
 Js = /static/assets/js/jquery.waypoints.min.js
 Js = /static/assets/js/jquery.waypoints-sticky.min.js
 Js = /static/assets/js/jquery.counterup.min.js
 Js = https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.8.3/modernizr.min.js
 Js = /static/assets/js/app.js

Let's print the content of the app.js file:


# app_url is initialized a few line above:
# app_url = 'https://flask-bulma-css.appseed.us/'
app_js = requests.get(app_url + '/static/assets/js/app.js')

# to check the status, just type the name of the object
app_js
<Response [200]> # all good, let's print the content of the remote file

app_js.content
# some unminified js code will be listed here.

Let's print level one elements from the page body:


# the code
for elem in soup.body.children:
   if elem.name: # we need this check, some elements don't have name
      print( ' -> elem ' + elem.name )

# the output
 -> elem div
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem footer
 -> elem div
 -> elem div
 -> elem div
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script

Let's print the footer:

soup.footer

# to have a nice print of elements, we can use BS prettify() helper
# using prettify(), the output is nicely indented 

print(soup.footer.prettify())

# the output
<footer class="footer footer-dark">
 <div class="container">
  <div class="columns">
   <div class="column">
    <div class="footer-logo">
     <img alt="Footer Logo for BulmaPlay - JAMStack Bulma CSS Web App." src="/static/assets/images/logos/bulmaplay-logo.png"/>
    </div>
....
    </div>
   </div>
  </div>
 </div>
</footer>

And the last code snippet, let's print the anchors referred in the footer section:


# the code
for elem in soup.body.footer.find_all('a'):
    print(' footer href = ' + elem['href'])

# the output
 footer href = https://bulma.io
 footer href = https://github.com/app-generator/flask-bulma-css
 footer href = https://appseed.us/apps/bulma-css?flask-bulma-css
 footer href = https://blog.appseed.us/tag/bulma-css
 footer href = https://absurd.design/
 footer href = https://github.com/cssninjaStudio/fresh

Developer Tools - Open-Source HTML Parser - related article
HTML Parser - Extract HTML information with ease - A few practical code snippets to extract and process HTML information

Other Parsing Resources

HTML Parser - How to use Python BS4 to work less
HTML Parser - used by the AppSeed App Generator to parse flat HTML
BeautifulSoup Html Parser documentation
HTML Parser sources - the official public repository
HTML Parser provided by AppSeed
HTML Parser - Convert HTML to Jinja2 and Php components - related blog article
Video presentation HTML parsing and components extraction

Useful? AMA in the comments. Thank you & happy HTML parsing!

Top comments (3)

Azad Husen • Sep 25 '19

Can I convert any HTML website to mobile responsive?

Areahints • Aug 1 '19

Awesome guide!

I made a simple scraper that writes the page to HTML. so I have an offline copy of the site, will your tuts work with the offline copy?

Sm0ke • Aug 1 '19

Hello, thank you!
Yes, you can load the HTML from a file, instead of crawling. Please take a look at this article:

dev.to/sm0ke/html-parser-developer...

The relevant code snippet:


# read_file retun the file content as string
html_content = read_file('index.html')
soup  = bs(html_content,'html.parser')

Happy parsing!
.. <('_')> ..

DEV Community