loading...
Cover image for HTML Parser - Extract information from a LIVE website

HTML Parser - Extract information from a LIVE website

sm0ke profile image Sm0ke Updated on ・4 min read

Hello Coder,

In this article, I will present a short-list with code snippets useful to extract information from a live website. The code is written in Python on top of BeautifulSoup HTML Parsing library.


What is an HTML Parser

According to Wikipedia, Parsing or syntactic analysis is the process of analyzing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. The meaning of HTML parsing applied here is basically, crawling the HTML code and extracting, processing relevant information like head title, page assets, main sections.


Want to learn more about this topic? Access the AppSeed platform for more articles related to HTML parsing and developer tools. Thank you!


Setup the environment

To execute the sample code, we need a Python environment and a few useful dependencies:

$ pip install ipython # the console where we execute the code
$ pip install requests # a library to pull the entire HTML page
$ pip install BeautifulSoup # the real magic is here 

If all goes well, we can start coding. Please type ipython to start the interactive Python console:

# import libraries
import requests
from bs4 import BeautifulSoup as bs

# define the URL to crawl & parse
# feel free to change this URL with your own app
app_url = 'https://flask-bulma-css.appseed.us/'

# crawling the page. This might take a few seconds
page = requests.get( app_url )

# to check the crawl status, just type:
page
<Response [200]> # all good

# to print the page contents type:
page.content 

At this point, we have the page content, let's inject the HTML into BeautifulSoup and get some information from the remote page.

soup = bs(page.content, 'html.parser')

# print the entire page head
soup.head

# print only the title
soup.head.title
<title>Flask Bulma CSS - BulmaPlay Open-Source App </title>

To check the accuracy of the result we can check out the source of the page.


Where to go from here?

Using BS library we can easily manipulate the DOM. For instance, let's print the Javascript files used by the HTML file, using just a few lines of code:


# the code
for script in soup.body.find_all('script', recursive=False):
    print(' Js = ' + script['src'])

# the output
 Js = /static/assets/js/jquery.min.js
 Js = /static/assets/js/jquery.lazy.min.js
 Js = /static/assets/js/slick.min.js
 Js = /static/assets/js/scrollreveal.min.js
 Js = /static/assets/js/jquery.waypoints.min.js
 Js = /static/assets/js/jquery.waypoints-sticky.min.js
 Js = /static/assets/js/jquery.counterup.min.js
 Js = https://cdnjs.cloudflare.com/ajax/libs/modernizr/2.8.3/modernizr.min.js
 Js = /static/assets/js/app.js

Let's print the content of the app.js file:


# app_url is initialized a few line above:
# app_url = 'https://flask-bulma-css.appseed.us/'
app_js = requests.get(app_url + '/static/assets/js/app.js')

# to check the status, just type the name of the object
app_js
<Response [200]> # all good, let's print the content of the remote file

app_js.content
# some unminified js code will be listed here. 

Let's print level one elements from the page body:


# the code
for elem in soup.body.children:
   if elem.name: # we need this check, some elements don't have name
      print( ' -> elem ' + elem.name )

# the output
 -> elem div
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem section
 -> elem footer
 -> elem div
 -> elem div
 -> elem div
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script
 -> elem script

Let's print the footer:

soup.footer

# to have a nice print of elements, we can use BS prettify() helper
# using prettify(), the output is nicely indented 

print(soup.footer.prettify())

# the output
<footer class="footer footer-dark">
 <div class="container">
  <div class="columns">
   <div class="column">
    <div class="footer-logo">
     <img alt="Footer Logo for BulmaPlay - JAMStack Bulma CSS Web App." src="/static/assets/images/logos/bulmaplay-logo.png"/>
    </div>
....
    </div>
   </div>
  </div>
 </div>
</footer>


And the last code snippet, let's print the anchors referred in the footer section:


# the code
for elem in soup.body.footer.find_all('a'):
    print(' footer href = ' + elem['href'])

# the output
 footer href = https://bulma.io
 footer href = https://github.com/app-generator/flask-bulma-css
 footer href = https://appseed.us/apps/bulma-css?flask-bulma-css
 footer href = https://blog.appseed.us/tag/bulma-css
 footer href = https://absurd.design/
 footer href = https://github.com/cssninjaStudio/fresh


Related Articles


Other Parsing Resources


Useful? AMA in the comments. Thank you & happy HTML parsing!

Posted on by:

sm0ke profile

Sm0ke

@sm0ke

#Automation, my favorite programming language

Discussion

markdown guide
 

Can I convert any HTML website to mobile responsive?

 

Awesome guide!

I made a simple scraper that writes the page to HTML. so I have an offline copy of the site, will your tuts work with the offline copy?

 

Hello, thank you!
Yes, you can load the HTML from a file, instead of crawling. Please take a look at this article:

dev.to/sm0ke/html-parser-developer...

The relevant code snippet:


# read_file retun the file content as string
html_content = read_file('index.html')
soup  = bs(html_content,'html.parser') 

Happy parsing!
.. <('_')> ..